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Chapter  1 
Introduction 


In  many  engineering  applications  we  choose  to  view  the  world  (an  unknown  signal)  through  a 
set  of  samples.  Often,  a  relatively  small  number  of  samples  tells  us  all  we  need  to  know.  For 
example,  the  classical  Whittaker-Nyquist-Kotelnikov-Shannon  sampling  theorem  states  that 
a  continuous-time  band-limited  signal  can  be  perfectly  reconstructed  from  uniformly  spaced 
discrete  samples  provided  that  the  sampling  rate  (number  of  samples  per  time)  is  greater 
than  twice  the  signal  bandwidth.  This  fact  is  crucial  to  the  analog-to-digital  conversion  in 
signal  processing  and  telecommunications. 

For  a  more  general  notion  of  what  it  means  to  sample  a  signal,  we  may  consider  a  variety 
of  interesting  applications  where  the  signals  of  interest  are  not  band-limited.  In  fact,  even 
more  can  be  said  when  we  consider  that  sometimes  the  information  we  desire  is  not  an 
unknown  signal  per  se,  but  rather  some  function  of  it.  Examples  from  the  past  decade 
include  spectrum  blind  sampling  (Bresler  et  al.  [1,  2,  3]),  sampling  with  a  finite  rate  of 
innovation  (Vetterli  et  al  [4])  and  compressed  sensing  (Donoho  [5]  and  Candes  &  Tao  [6], 
and  many  others). 

In  particular,  the  held  of  compressed  sensing  deals  with  the  digital-to-digital  sampling 
of  signals  that  are  somehow  compressible.  For  many  such  signals,  the  sampling  processes 
simultaneously  senses  (provides  a  set  of  samples  sufficient  to  reconstruct  an  unknown  signal) 
and  compresses  (the  number  of  samples  is  far  less  than  the  dimension  of  the  original  signal). 

For  a  given  sampling  application,  natural  questions  include:  1)  how  should  the  samples 
be  taken?  2)  how  many  samples  are  required?  3)  what  fidelity  is  required  to  represent  each 
sample?  and  4)  how  do  we  extract  the  desired  information  from  the  samples? 


1.1  Our  Contributions 

In  this  thesis,  we  consider  the  sampling  of  discrete-time  signals  that  are  sparse,  i.e.  most  of 
the  elements  are  zero.  Such  signals  contain  both  discrete  information  (the  support,  i.e.  the 
indices  of  the  non-zero  components)  as  well  as  non-discrete  information  (the  values  of  the 
non-zero  components).  In  applications  such  as  signal  estimation  and  compression  the  goal  is 
to  recover  the  original  signal  under  some  mean  squared  error  (MSE)  distortion  criterion.  In 
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applications  such  as  model  selection  and  regression  the  main  concern  is  to  correctly  identify 
the  support. 

We  focus  on  a  particular  form  of  sampling,  namely  noisy  linear  projections,  and  address 
questions  about  the  number  and  quality  of  samples  required.  The  fidelity  of  each  sample  is 
measured  by  the  per-sample  signal-to-noise  ratio  (SNR),  and  we  assume  that  the  size  of  the 
support  scales  linearly,  as  opposed  to  sub-linearly,  with  the  dimension  of  the  unknown  signal. 
We  make  the  following  contributions  for  the  under-sampled  large  system  setting  where  the 
number  of  samples  is  less  than  the  signal  dimension,  and  the  signal  dimension  becomes  very 
large: 

•  Perfect  support  recovery  is  hard:  If  the  per-sample  SNR  does  not  increase  with 
the  dimension  of  the  signal,  then  exact  recovery  of  the  support  is  not  possible.  Previous 
work  has  shown  this  to  be  the  case  for  particular  efficient  sub-optimal  reconstruction 
algorithms.  In  Theorem  3.1  of  Chapter  3,  we  show  that  it  is  also  true  for  any  possible 
reconstruction  algorithm. 

•  Fractional  support  recovery  is  not  as  hard:  We  introduce  a  notion  of  partial 

support  recovery  and  show  that  even  if  the  per-sample  SNR  does  not  increase  with  the 
dimension  of  the  signal,  it  is  still  possible  to  guarantee  recovery  of  some  fraction  of  the 
support.  Chapter  2  describes  our  fractional  distortion  metric.  Theorems  3.2  and  3.4  in 
Chapter  3  give  necessary  conditions  on  the  sampling  rate  and  SNR  required  to  recover 
a  given  fraction  of  the  support.  Equivalently,  these  results  may  be  seen  as  upper  bound 
on  the  fraction  that  can  be  recovered  using  any  possible  estimator.  Theorem  4.1  in 
Chapter  4  gives  a  complementary  set  of  sufficient  conditions  for  an  ideal  estimator  that 
uses  exhaustive  search. 

•  Stochastic  versus  worst-case  analysis:  Previous  work  on  perfect  support  recovery 
has  used  a  worst-case  analysis  that  requires  a  (lower)  bound  on  the  smallest  non¬ 
zero  signal  component.  In  this  thesis,  we  consider  both  stochastic  and  non-stochastic 
signal  models.  An  advantage  of  considering  signals  with  a  distribution  is  that  we 
can  give  performance  guarantees  even  when  the  smallest  non-zero  signal  component  is 
arbitrarily  small.  The  nature  of  these  guarantees  is  made  precise  in  Chapter  2  where  we 
define  a  notion  of  the  SNR  and  reliable  recovery  for  both  stochastic  and  non-stocastic 
signal  classes.  The  results  on  partial  support  recovery  in  Chapters  3  and  4  depend  only 
on  certain  properties  of  the  assumed  signal  class. 

•  Not  all  noise  is  the  same:  It  is  tempting  to  argue  that  noise  added  to  the  signal  prior 
to  sampling  can  be  “pushed  through”  the  sampling  process  and  equivalently  considered 
as  noise  added  after  the  sampling  process.  We  show  that  such  a  consideration  is 
inappropriate.  Chapter  5  address  the  task  of  signal  estimation  under  two  different 
noise  models  and  Theorem  5.1  gives  the  corresponding  asymptotic  MSE  distortions  as 
a  function  of  the  sampling  rate  and  SNR. 

We  remark  that  our  results  represent  fundamental  (information-theoretic)  limits.  For 
the  task  of  support  recovery,  our  necessary  bounds  hold  for  any  possible  estimator.  At  the 
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same  time,  our  achievable  results  correspond  to  an  ideal  estimator  that  performs  exhaustive 
search.  Such  an  estimator  is  computationally  expensive  and  thus  a  interesting  extension  of 
this  thesis  is  to  find  corresponding  achievable  results  for  an  efficient  estimator. 

Also,  we  emphasize  that  Chapters  3  and  4  are  concerned  with  sampling  bonds  for  support 
recovery  whereas  Chapter  5  compares  different  noise  models  on  the  task  of  signal  estimation. 
Since  the  tasks  of  support  recovery  and  signal  estimation  are  related,  a  natural  question  not 
addressed  in  this  thesis  is  what  effect  noise  added  prior  to  sampling  has  on  the  task  of  support 
recovery. 

The  remainder  of  this  introduction  is  as  follows:  Section  1.2  describes  our  general  sam¬ 
pling  model,  Section  1.4  gives  a  brief  overview  of  previous  work  in  this  area,  and  Section  1.3 
provides  an  example  application  in  sensor  networks. 


1.2  General  Sampling  Model 

Let  x  G  Mn  be  an  unknown  sparse  signal.  The  observation  model  can  be  generally  formulated 
as  a  sampling  problem  where  each  “sample”  consists  of  an  inner  product  between  x  and  some 
predetermined  measurement  vector  0*  G  Mre  and  some  random  noise  wy 

Hi  =  (<f>i,x)  +Wi  for  i  =  1,  ■  ■  ■  ,  m.  (1.1) 

We  will  collect  our  observations  into  a  vector  y  —  (yi,  y2,  ■  ■  ■  ,  ym)T  and  let  w  =  (w i,w2,  ■  ■  ■  wm)T. 
Then,  the  sampling  is  given  in  matrix  form  as 

y  =  $x  +  w,  (1.2) 

where  $  =  [0i,02,-''  >0n]T  is  an  m  x  n  sampling  matrix.  In  the  under-sampled  setting 
(m  <  n)  the  matrix  4>  is  not  invertible,  and  thus  general  inference  problems  are  challenging. 


y  $  x  w 
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The  locations  of  the  non-zero  elements  of  x  will  be  referred  to  as  the  support  of  x: 

K  —  {i  e  {1,  •  •  •  n}  s.t.  Xi  7^  0},  (1.3) 


where  k  =  \K\  is  the  number  of  non- zero  elements.  The  signal  x  is  sparse  if  k  is  significantly 
less  than  n.  We  use  to  denote  the  matrix  formed  by  columns  of  $  indexed  by  K  and 
xk  to  denote  the  vector  formed  by  the  elements  of  x  indexed  by  K .  Also,  K 1  denotes  the 
complement  of  the  support  and  thus  \K1\  —  n  —  k.  If  k  <  m,  then  the  submatrix  is 
invertible.  This  is  significant  because  it  means  that  if  we  know  the  support  K  then  our 
observation  model  corresponds  to  an  over-constrained  set  of  linear  equations.  In  general, 
however,  the  support  is  considered  to  be  unknown. 


y 


$ 


x  w 


Figure  1.2:  The  sparse  under-sampled  setting  when  the  support  K  corresponds  to  the  first 
k  elements  of  x  (gray  represents  zero  value) 


1.3  A  Sensor  network  application 

To  motivate  the  problem  of  sparse  support  recovery  with  finite  (non-increasing)  SNR  we 
provide  the  following  sensor  network  example  (variations  can  be  found  in  [7,  8,  9]). 

Imagine  that  there  is  some  sparse  phenomenon  in  our  environment  which  may  be  modeled 
as  a  sparse  vector  x  e  whose  indices  correspond  to  specific  locations.  We  desire  to  locate 
the  phenomenon  (i.e.  identify  the  support  of  x)  using  a  spatially  distributed  network  of  n 
sensors.  If  the  placement  of  the  sensors  corresponds  to  the  locations  indexed  by  x,  then  the 
vector  of  sensor  observations  x  G  M”  will  also  be  sparse.  More  generally,  however,  we  may 
assume  that  there  is  some  known  one-to-one  linear  transformation  between  x  and  the  sensor 
observations  (represented  by  T  G  Mnxn),  and  so  the  observations  x  =  Trr  are  in  general 
non-sparse. 

To  determine  the  support  of  x,  it  is  clearly  sufficient  to  collect  the  observations  from  all 
n  sensors.  However,  since  power  and  bandwidth  are  likely  to  be  scarce  resources,  it  may  be 
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a  much  better  idea  to  use  one  of  the  procedures  outlined  in  [7,  8,  9]  to  efficiently  compute 
m  linear  projections  of  the  data  x.  Then,  the  data  we  receive  from  the  network  is  of  the 
form  y  =  'h'l/a;  +  w  where  the  noise  w  results  from  observation  noise  at  each  sensor,  and 
computation  and  communication  across  the  network. 

Under  assumptions  about  x,  we  may  pose  recovery  in  terms  of  the  SNR.  As  the  problem 
size  n  increases  it  is  important  that  the  size  (in  bits)  of  each  sample  remains  constant. 
Otherwise,  the  communication  will  ultimately  overwhelm  the  network  even  if  the  ratio  m/n 
stays  fixed.  Our  results  are  significant  because  they  show  that  such  a  network  can  guarantee 
recovery  of  a  fixed  fraction  of  the  support  with  m  <C  n  “network  samples” . 


noise 

w 


<±> 


n- dimensional  observations 

k- sparse  signal  at  n  sensors 


x 


m  samples 


Figure  1.3:  Sparse  signal  sampling  using  a  sensor  network 


1.4  Overview  of  Related  Work 

There  is  a  rich  literature  on  sparse  signal  reconstruction  and  support  recovery.  In  this  section 
we  present  a  summary  of  some  of  the  relevant  research  with  an  emphasis  on  support  recovery. 

In  the  noiseless  setting  ( w  =  0)  it  has  been  shown  that  the  support  of  any  signal  with 
exactly  k  non-zero  elements  can  be  recovered  from  just  m  —  k  +  1  samples  [1].  However, 
such  estimation  is  a  lattice-decoding  problem.  Compressive  sensing  [10,  5,  6]  has  shown  that 
for  a  small  increase  in  the  number  of  noiseless  samples,  m  =  0(k  log(n//c)),  perfect  recovery 
can  be  achieved  using  an  efficient  algorithm  (linear  program)  called  Basis  Pursuit. 

In  the  presence  of  noise,  perfect  support  recovery  becomes  more  difficult.  Compressive 
sensing  results  [11,  12]  show  that  for  m  =  0(klog(n/k))  samples  there  exist  efficient  algo¬ 
rithms  (quadratic  programs)  which  can  provide  an  estimate  x  that  is  stable,  that  is  ||x  —  x|| 
is  bounded  with  respect  to  ||iu||.  Clearly  such  estimation  procedures  must  supply  some  in¬ 
formation  about  the  support  but  how  much?  A  partial  answer  is  provided  by  [13,  14,  15,  11] 
where  it  is  shown  that  it  is  possible  to  obtain  an  estimate  x  whose  support  is  contained 
inside  the  support  of  x.  However,  no  guarantees  are  given  on  the  size  of  estimated  support. 

Taking  a  slightly  different  approach,  the  work  of  [16,  17,  18]  has  addressed  the  ability 
of  a  particular  l\  constrained  quadratic  program,  the  Lasso,  to  guarantee  perfect  support 
recovery  in  the  large  system  setting  (n  — >  oo).  Results  are  formulated  in  terms  of  scaling 
conditions  for  (n,  k,  m)  and  the  magnitude  of  the  smallest  non-zero  component  of  x  denoted 
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xmin .  Most  relevant  to  this  thesis,  is  the  work  of  Wainwright  [18]  which  shows  that  in 
the  linear  sparsity  regime,  if  (pi  are  i.i.d.  J\f( 0,  -I)  then  both  the  sufficient  and  necessary 
conditions  for  perfect  recovery  (using  the  Lasso)  require  that  either  m/n  — >  oo  or  xmjn  — >  oo. 

Another  line  of  research  has  considered  information-theoretic  bounds  on  the  performance 
of  the  optimal  support  estimator,  also  in  the  large  system  setting  n  oo  .  Gastpar  and 
Bresler  [3]  developed  a  lower  bound  on  the  probability  of  perfect  support  recovery  in  the 
case  where  {(pi}  correspond  to  m  rows  of  the  n  x  n  Fourier  matrix  and  the  non-zero  elements 
of  x  are  i.i.d  Gaussian.  For  Gaussian  <f>,  Sarvotham  et  al.  [19]  have  given  an  arbitrary  rate- 
distortion  lower  bound,  and  Fletcher  et  al.  [20,  21]  have  studied  the  rate-distortion  behavior 
of  the  signal  x.  Also,  recent  work  by  Aeron  et  al.  [22]  has  studied  the  information  theoretic 
properties  of  sensor  networks  for  sparse  signal  recovery. 

Finally,  a  major  source  of  inspiration  for  this  thesis  comes  from  Wainwright  [23]  who 
provided  both  lower  and  upper  bounds  on  the  number  of  samples  in  terms  of  the  scaling 
of  (n,  k,  m)  and  a;min.  In  the  linear  sparsity  setting,  the  conditions  of  the  sufficient  bound 
are  less  stringent  than  in  [18]  but  still  require  that  either  m/n  — >  oo  or  that  xmin  — >  oo. 
As  Wainwright  points  out,  there  is  a  significant  gap  between  this  achievable  bound  and  the 
corresponding  necessary  bound  which  is  finite,  namely  m/n  <  1,  for  a  fixed  value  of  xmin. 
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Chapter  2 

Problem  Setup 


This  chapter  defines  three  components  of  our  sampling  setup.  Section  2.1  describes  how  the 
samples  will  be  taken.  Section  2.2  address  what  is  known,  or  assumed  to  be  known  about 
the  signal  of  interest  prior  to  sampling  and  defines  two  particular  signal  classes.  Section  2.3 
addresses  the  information  of  interest  when  decoding  the  received  signal,  and  describes  two 
related  but  different  recovery  tasks. 

2.1  Specific  Sampling  Model 

For  x  G  Rn  we  consider  a  linear  observation  model  (introduced  in  Section  1.2)  in  which 
samples  y  G  Mm  are  taken  as 

y  =  $x  +  w,  (2.1) 

where  G  Mmxn  is  a  sampling  matrix  and  w  G  Mm  is  noise.  We  assume  that  the  x  has 
exactly  k  non-zero  elements  which  are  indexed  by  the  support  K ,  and  that  K  is  distributed 
uniformly  over  the  (”')  possibilities. 

We  further  assume  that  the  sampling  matrix  <f>  G  M'mxn  is  randomly  constructed  with 
i.i.d.  rows  (/>*  ~  A/"(0,  -I).  This  matrix  construction  is  a  common  choice  in  the  compressed 
sensing  literature,  and  we  focus  on  it  for  two  reasons.  First,  it  also  allows  us  to  use  powerful 
asymptotic  results  for  chi  squared  random  variables  (Appendix  A)  and  certain  well  studied 
random  matrices  (Appendix  B)  to  develop  our  bounds. 

The  second  reason  for  our  choice  of  <f>  is  that  its  distribution  is  invariant  to  rotation. 
That  is,  for  any  n  x  n  orthonormal  matrix  T,  the  matrix  <f>T  is  equal  in  distribution  to 
$.  The  significance  of  this  fact  is  that  all  our  results  also  apply  to  the  setting  where  the 
observed  signal  is  not  sparse  per  se,  but  is  sparse  with  respect  to  some  known  orthonormal 
basis  T.  Thus  our  sampling  model  extends  to  the  more  general  setting 

y  =  +  w.  (2.2) 

This  sampling  model  is  crucial  to  many  applications  and  is  discussed  in  our  examples  of 
sensor  network  applications  in  Sections  1.3  and  5.1.  However,  for  the  rest  of  this  thesis  we 
find  it  convenient  to  describe  our  results  in  terms  of  sampling  model  (2.1). 
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In  our  analysis  of  partial  support  recovery  in  Chapters  3  and  4  we  assume  that  w  ~ 
J\f(Q,<r^jI).  In  our  analysis  of  signal  estimation  in  Chapter  5  we  investigate  what  happens 
if  noise  is  added  prior  to  sampling.  Thus  we  compare  two  noise  models:  one  in  which 
w  ~  A/"(0,  a2 /)  and  one  in  which  w  AT(0  ,a2w$$T). 

Our  our  goal  is  to  understand  to  the  asymptotic  performance  of  the  sampling  model. 
Hence,  we  assume  that  the  size  of  the  support  k  =  kn  and  number  of  samples  m  =  mn 
depend  on  ambient  dimension  n  and  we  see  what  happens  as  n  oo.  As  mentioned  in 
Section  1.4  there  are  many  interesting  choices  for  the  scalings  of  kn.  We  exclusively  consider 
the  setting  of  linear  sparsity  where  kn  is  a  linear  function  of  n.  We  are  interested  in  which 
sampling  tasks  can  (and  cannot)  be  solved  in  the  under-sampled  setting  where  mn  <  n.  We 
provide  the  following  definitions. 

Definition  2.1.  The  sparsity  is  Qn  =  kn/n  and  the  asymptotic  sparsity  is  O  =  limn_.00  kn/n. 
This  parameter  is  a  measure  of  the  “bandwidth”  of  a  signal,  and  linear  sparsity  corresponds 
to  0  <  12  <  1. 

Definition  2.2.  The  sampling  density  is  pn  =  mn/n  and  the  asymptotic  sampling  density 
is  p  =  lim^oo  mn/n.  In  the  under-sampled  setting  p  <  1. 

We  find  it  convenient  to  consider  a  sampling  matrix  that  preserves  the  magnitude  of 
x.  Specifically,  we  choose  to  scale  the  variance  of  each  element  of  rows  (fif  as  1  jn  so  that 
E =  |ja;||2/n,  and  we  will  consider  signals  whose  average  energy  ||:c||2/n  does  not 
depend  on  n.  We  caution  the  reader  that  these  choices  are  in  contrast  to  some  of  the  related 
work  [18,  23]  where  $  is  chosen  such  that  E[(c/>j,  a;)2]  =  ||a;||2  and  hence  $  amplifies  the  signal 
x. 


2.2  Sparse  Signal  Classes 


This  thesis  is  concerned  with  sampling  signals  that  are  sparse.  However,  to  make  guarantees 
on  what  can  or  cannot  be  achieved  requires  additional  information  about  the  class  of  the 
unknown  signal.  Let  X  denote  a  class  of  sparse  signals  and  let  Xn  denote  the  sub-class  of 
signals  with  length  n.  One  of  the  most  useful  measures  we  need  is  the  signal-to-noise  ratio. 


Definition  2.3.  For  a  given  signal  x,  the  per-sample  signal-to-noise  ratio  (SNR)  is 

(2- 

Although  the  above  definition  also  refers  to  the  total  signal-to-noise  ratio,  we  call  it  the 
per-sample  signal-to-noise  ratio  to  emphasize  that  it  is  a  property  of  the  average  signal  and 
noise  energies  and  does  not  depend  on  the  number  of  samples. 

Performance  guarantees  will  require  good  bounds  on  SNR(x),  and  the  types  of  bounds 
we  can  use  depend  on  the  assumed  signal  class  X.  In  general  we  may  want  absolute  bounds 
which  hold  for  all  x  E  Xn.  For  stochastic  signal  classes  however,  it  may  be  advantageous  (and 


SNR(a:)  = 


E[||$x||5 

e[|hi2; 


n<7~ 


x 
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necessary)  to  consider  bounds  which  hold  with  some  desired  probability.  Although  a  variety 
of  (potentially  weaker)  bounds  could  be  considered,  we  will  use  the  following  definition  in 
our  analysis. 

Definition  2.4.  For  a  given  signal  class  A,  SNR(A)  is  an  asymptotic  lower  bound  on  SNR(x). 
If  X  is  non-stochastic  then  the  bound  must  satisfy 

SNR(A)  <  SNR(x)  for  all  x  &  X .  (2.4) 

If  X  is  stochastic  then  there  must  exist  some  constant  c  >  0  such  that 

P{SNR(An)  <  SNR(x)}  >  1  -  e~nc  (2.5) 

Two  other  useful  measures  are  the  following. 

Definition  2.5.  For  a  given  signal  class  X ,  the  relative  size  of  the  smallest  non-zero  element 
of  x  is  characterized  by  /3l(X).  If  X  is  non-stochastic  then 

(3l(X)  =  lim  inf  inf  x2Ja2w  (2.6) 

n—> oo  x£X  i£h 

If  X  is  stochastic  then  (3l{X )  is  a  probabilistic  upper  bound  and  there  must  exist  some 
constant  c  >  0  such  that 


F{  mi nxi/crl  <  (3L{Xn)}  >  1  -  e  nc  (2.7) 

l£K 

Definition  2.6.  For  a  given  stochastic  signal  class  X  with  E(x)  =  0  and  E[a;2]  =  <r2,  the 
normalized  per-sample  signal-to-noise  ratio  is  given  by  /3(X)  =  a 2/<r2  such  that  E[SNR(x)]  = 

np(x). 

In  general  a  wide  variety  of  signal  classes  may  be  considered.  In  the  following  sections 
we  introduce  two  example  classes,  one  stochastic  and  one  non-stochastic. 

2.2.1  Non-stochastic  Bounded  Signals 

Often  it  is  appealing  to  have  models  that  do  not  assume  a  distribution.  Such  models  may 
arise  naturally  when  we  need  a  worst-case  analysis.  Also,  resulting  claims  are  robust  in  that 
they  do  not  depend  on  the  choice  or  parameters  of  an  assumed  distribution.  Previous  work 
on  support  recovery  [16,  17,  18]  has  focused  on  the  following  class. 

Definition  2.7.  Bn  is  the  set  of  all  x  E  Mra  whose  non-zero  elements  are  bounded  from  below 
in  magnitude,  that  is  |xi|  >  xm;n  for  all  i  E  K  where  xm\n  is  a  known  constant  that  does  not 
depend  n. 

For  the  class  B  it  is  clear  that  (3l{B )  =  xTn/c72,  and  SNR(I5)  =  Q/3l- 
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2.2.2  Gaussian  Signals 

In  this  thesis,  we  also  consider  the  class  of  Gaussian  signals  which  is  ubiquitous  in  information 
theory,  signal  processing,  and  communications. 

Definition  2.8.  Let  Qn  be  the  set  of  all  x  G  W  whose  non-zero  elements  are  Gaussian,  that 
is  Xi  are  i.i.d.  J\f( 0,  a2)  for  all  %  G  K. 

Since  any  “non-zero”  element  of  x  can  be  arbitrarily  close  to  zero,  support  recovery 
becomes  more  difficult  than  when  there  is  a  fixed  bound  on  xmin.  We  note  that  SNR(a;) 
is  a  random  variable  that  obeys  concentration  of  measure.  Using  standard  large  deviation 
bounds  for  y2  variables  (Lemma  A.l)  we  see  that  for  any  e  >  0  we  may  choose  SNR(fy„)  = 
(1  —  e/n)Gln/3(Gn )  where  /3(Q„)  =  <r2/cr2  .  For  simplicity  we  will  use  the  limit  SNR(^)  = 
SNR(^oc)  =  Q/3(Q).  Also,  we  may  trivially  choose  Pl(G)  =  /3(G)  although  much  tighter 
bounds  are  possible. 


2.3  Recovery  Tasks 

Generally  stated,  the  goal  of  sampling  is  to  recover  some  function  of  the  unknown  signal 
to  within  some  desired  distortion.  In  the  following  sections  we  look  at  two  different  types 
of  tasks,  signal  estimation  and  support  recovery,  and  propose  corresponding  distortion  mea¬ 
sures. 


2.3.1  Support  Recovery 

Given  the  true  support  K  and  any  estimate  K  there  are  several  natural  measures  for  the 
distortion  d(K,K).  One  may  consider  recovery  of  K  as  a  target  recognition  problem  where 
for  each  index  i  G  {1,  •  •  •  ,  n}  we  want  to  determine  whether  or  not  i  is  in  the  support  K. 
At  one  extreme,  minimization  of 


d'(K,  K) 


\K\  -  \K\  K  C  K 
oo  K  D  K 


attempts  to  find  the  largest  subset  K  that  is  contained  in  K.  The  results  of  [13,  14,  15,  11] 
can  be  interpreted  in  terms  of  this  metric.  Roughly  speaking,  their  results  guarantee  that 
d'(K,K)  <  |  A' |  but  cannot  say  much  more  because  no  guarantees  are  given  on  the  size  K . 
At  the  other  extreme,  minimization  of 


d"(K,  K) 


oo  K  C  K 

\k\  -  \K\  k  D  K 


attempts  to  find  the  smallest  subset  K  that  contains  the  true  support,  and  in  general  one 
may  formulate  a  Neyman- Pearson  style  tradeoff  between  the  two  types  of  errors. 
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Our  focus  is  on  reconstruction  at  the  point  where  the  number  of  false  positives  is  equal 
to  the  number  of  false  negatives.  Since  we  assume  that  \K\  is  known  a  priori,  we  can  impose 
this  condition  by  requiring  that  \K\  =  \K\.  Accordingly,  we  use  the  following  metric  which 
is  proportional  (by  a  factor  of  two)  to  the  total  number  of  errors 


d(K,  K) 


\K\  -  I k  n  k 

oo 


\k\  =  \K\ 
\k\  ?  \K\ 


For  the  remainder  of  this  thesis  we  consider  only  candidate  supports  K  that  are  the  same 
size  as  the  true  support. 


Figure  2.1:  Possible  distortion  metrics 


We  can  define  partial  recovery  as  the  requirement  that  d(K,  K)  <  a*  for  some  a*  >  0.  If 
we  consider  a*  to  be  a  function  of  n  then  there  are  several  interesting  choices  for  the  scaling 
of  a*.  For  instance,  if  recovery  is  possible  with  a*  =  0(logn)  then  as  n  — >  oo  the  average 
distortion  ^-d(K,  K)  — >  0  although  the  allowable  number  of  errors  d(K,  K)  — >  oo.  Our 
results,  however,  pertain  to  linear  scalings  between  a*  and  kn. 

Definition  2.9.  The  fractional  distortion  is  a  >  0,  and  fractional  partial  recovery  is  the 
requirement  that  d(K,  K)  <  a*  where  a*  =  a  kn. 

The  requirement  a  =  0  corresponds  to  perfect  recovery  whereas  the  requirement  a  =  1 
is  always  satisfied  (since  we  assume  \K\  =  \K\). 

Our  analysis  will  use  the  following  definitions  to  characterize  the  performance  of  an 
estimator  K(y)  with  respect  to  fractional  partial  recovery.  Recall  that  y  is  a  function  of  x 
and  thus  the  performance  depends  on  X . 

Definition  2.10.  Let  K(y)  be  an  estimator  of  K .  If  Xn  is  non-stochastic,  then 

Pe(a,  Xn)  =  maxP  | d(K,  K(y ))  >  a  (2.8) 

where  the  probability  is  over  w,  and  <h.  If  Xn  is  stochastic,  then 

Pe(<x,  Xn)  =  P  [d(K,  K(y))  >ak }  (2.9) 

where  the  probability  is  over  x,  w ,  and  <L. 
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Definition  2.11.  An  estimator  K(y)  is  said  to  be  asymptotically  reliable  for  a  class  X  if 
there  exists  some  constant  c  >  0  such  that  Pe(a,Xn)  <  e~nc. 

Definition  2.12.  An  estimator  K(y)  is  said  to  be  asymptotically  unreliable  for  a  class  X  if 
there  exists  some  constant  c  >  0  and  integer  N  such  that  Pe(a,  Xn)  >  c  for  all  n  >  N. 

We  remark  that  a  weaker  notation  of  reliable  recovery  is  to  constrain  the  expected  distor¬ 
tion,  that  is  to  require  that  Kx[d(K,  K(y))\  <  a  k.  Although  such  a  statement  means  that 
on  average  the  fractional  distortion  is  less  than  a,  it  is  still  possible  that  a  linear  fraction  of 
all  possible  supports  have  resulting  distortion  greater  than  a.  Our  notion  of  asymptotically 
reliable  recovery  implies  more.  It  says  that  although  there  may  be  a  set  of  “bad”  supports 
with  resulting  distortion  greater  than  a,  the  size  of  this  set  very  small  relative  to  the  total 
number  of  possible  supports. 

For  a  baseline  performance  measure,  we  consider  an  estimator  Krg  which  randomly 
guesses  a  subset  of  size  k  independent  of  the  data.  It  is  clear  that  E [d(K,  /1rg)]  =  (1  —  tt)k 
which  corresponds  to  a  fraction  distortion  of  a  =  1  —  hi.  Moreover,  this  value  is  a  sharp 
threshold  as  is  seen  by  the  following  lemma. 

Lemma  2.1.  The  random  guessing  estimator  Krg  is  asymptotically  reliable  for  any  a  > 
1  —  hi  and  asymptotically  unreliable  for  any  a  <  1  —  hi. 

Proof.  This  follows  from  an  extension  of  Hoeffding’s  Inequality.  This  particular  problem, 
which  corresponds  to  sampling  without  replacement,  is  addressed  by  Hoeffding  [24],  □ 

The  significance  of  Lemma  2.1  is  that  no  samples  are  needed  to  guarantee  any  fraction 
distortional  a  >  1  —  hi.  Accordingly,  this  thesis  focuses  on  support  reconstruction  for  a  G 
[0, 1  —  0]. 


2.3.2  Signal  Estimation 

A  common  distortion  measure  for  signal  estimation  is  the  mean  squared  error  (MSE).  For 
simplicity  we  consider  the  normalized  mean  squared  error,  | |x  —  x\ \2/\ |x| j2,  and  define  the 
following  metric. 

Definition  2.13.  Let  x(y)  be  an  estimator  of  x.  If  Xn  is  non-stochastic,  then 


D{Xn) 


max 

xexn 


E[||:r(y)  —  x\\2 
IM|2 


5 


where  the  expectation  is  over  w,  and  f.  If  Xn  is  stochastic,  then 


D{Xn) 


E[||x(t/)  —  x||2 

IRFi 


where  the  probability  is  over  x,  w,  and  f>. 


(2.10) 


(2.11) 
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Chapter  3 

Support  Recovery:  Necessary 
Conditions 


This  chapter  deals  with  fundamental  limits  on  support  recovery.  For  the  sampling  model 
described  in  Chapter  2,  we  give  necessary  conditions  on  the  number  and  quality  of  samples 
needed  to  recover  a  given  fraction  of  the  support.  In  Section  3.1.1  we  show  that  perfect 
recovery  is  not  possible  unless  the  SNR  increases  without  bound  with  dimension  n.  This 
means  that  as  n  becomes  large,  either  the  noise  must  disappear  or  the  magnitude  of  each 
non-zero  element  of  x  must  increase  without  bound.  In  Section  3.1.2  we  give  an  upper 
bound  on  fraction  of  the  support  that  can  be  recovered  for  a  fixed  SNR.  Section  3.2  provides 
discussion,  and  proofs  are  given  in  Section  3.3. 


3.1  Results 

Our  results  come  in  two  flavors.  The  bound  on  perfect  recovery  in  Section  3.1.1  is  a  sufficient 
condition  for  any  estimator  to  be  asymptotically  unreliable.  The  bounds  on  partial  recovery 
in  Section  3.1.2  give  necessary  conditions  any  estimator  to  be  asymptotically  reliable. 

3.1.1  Perfect  Recovery 

In  the  paper  [23],  Wainwright  gives  sufficient  and  necessary  conditions  for  perfect  support 
recovery.  With  respect  to  our  sampling  model,  the  sufficient  conditions  require  (3l(X)  =  oo 
whereas  the  necessary  conditions  are  satisfied  with  / 3l(X )  <  oo.  In  the  following  theorem  we 
show  that  (3l(X)  <  oo  is  a  sufficient  condition  for  asymptotically  unreliable  recovery,  and 
thus  Pl(X)  —  oo  is  a  necessary  condition  for  asymptotically  reliable  recovery. 

Theorem  3.1.  For  a  given  signal  class  X,  sparsity  12  G  (0,1),  and  sampling  rate  p  <  1, 
consider  the  task  of  perfect  support  recovery,  i.e.  the  fractional  distortion  a  =  0.  If  (3l(X)  < 
oo  then  any  estimator  K(y)  is  asymptotically  unreliable 
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We  we  remark  that  Theorem  3.1  is  very  general  in  that  it  depends  only  on  the  behavior 
of  the  smallest  non-zero  element  of  x.  This  means  that  perfect  recovery  is  not  possible  unless 
the  per-samplc  SNR  grows  without  bound  with  n. 

To  see  why  this  result  makes  sense  we  consider  an  observation  model  with  m  =  n  and 
$  =  In  such  that  y  =  x  +  w.  In  this  case,  it  is  clear  that  there  will  always  be  some  positive 
probability  of  error.  Our  result  shows  that  the  same  is  true  when  we  observe  y  =  <f>a:  +  w 
for  m  <  n  and  <f>  Gaussian. 

3.1.2  Partial  Recovery 

The  bounds  in  this  section  are  information-theortic  in  nature.  The  following  definitions 
allow  us  to  characterize  the  number  bits  needed  to  describe  the  support  K  to  within  a 
desired  fractional  distortion  a. 

Definition  3.1.  For  p  G  [0, 1]  the  function  h(p)  =  —plog(p)  +  (1  —  p)  log(l  —p)  is  the  binary 
entropy  function. 

Definition  3.2.  For  p  €  [0, 1]  and  u  E  [0,1  —  p]  we  define 

h((l,  a)  =  i 1h(a)  +  (1  -  i 1)h  f  d  (3.1) 

Let  /Cn(a)  =  {U  :  d(K,  U )  =  a}  be  the  set  of  supports  with  distortion  equal  to  a.  Using 
the  fact  [25]  that  log  (”)  =  nh(k/n)  +  0(\ogn)  gives 

1 

—  log  |  fCn(a  O  n)  |  — >  h(Q,a)  as  n  — >  oo. 

n 

Hence,  the  asymptotic  bit  rate  required  to  encode  the  support  K  to  within  fractional  dis¬ 
tortion  a  is  given  by  h(Q)  —  h(Q,a).  Note  that  at  a  =  1  —  O  this  quantity  is  zero.  This 
corresponds  to  Lemma  2.1  which  shows  that  1  —  is  a  natural  upper  bound  on  a. 

In  the  following  theorems  we  essentially  plug  our  rate-distortion  function  (i.e.  the  neces¬ 
sary  bit  rate  h(Q)  —  h(Q,  a))  into  known  bounds.  The  first  result  applies  to  general  signal 
classes. 

Theorem  3.2.  For  a  given  signal  class  X,  sparsity  e  (0,1),  sampling  rate  p  <  1,  and 
fractional  distortion  a  G  (0, 1  —  O) ,  a  necessary  condition  for  asymptotically  reliable  recovery 
is 

h(n)-h(n,a) 

|  log  (l  +  SNR(T)) 

This  bound  uses  a  straightforward  application  of  the  data,  processing  inequality  and  has 
been  previously  observed  in  [19]  in  terms  of  some  general  rate-distortion  function.  Because 
the  only  dependence  on  the  signal  class  X  is  in  the  SNR,  this  bound  is  general  and  works 
for  both  stochastic  and  non-stochastic  classes. 
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The  drawback  of  Theorem  3.2  is  that  the  bound  is  very  loose.  In  the  paper  [3],  Gastpar 
and  Bresler  analyze  a  sampling  model  where  the  sampling  matrix  corresponds  to  m  rows  of 
the  n  x  n  Fourier  matrix  and  develop  a  tighter  bound  for  stochastic  signals.  This  bound  can 
be  extended  to  onr  sampling  matrix  and  is  given  below. 

Theorem  3.3.  For  a  given  stochastic  signal  class  X ,  sparsity  e  (0, 1),  sampling  rate 
p  <  1,  and  fractional  distortion  a  G  (0, 1  —  f2)  ,  a  necessary  condition  for  asymptotically 
reliable  recovery  is 


h(Vt)  -  h(n,  a)  +  ±I(x-,y\K) 
\  log  (l  +  SNR(T)) 


where  I(x;y\K)  is  mutual  information  between  x  and  y  conditioned  on  K. 


(3.3) 


Notice  that  Theorem  3.2  follows  immediately  from  Theorem  3.3  and  the  non- negativity 
of  mutual  information.  Hence,  the  sampling  density  bound  in  Theorem  3.3  is  greater  than 
or  equal  to  the  bound  in  in  Theorem  3.2.  However,  the  conditional  mutual  information  is 
difficult  to  compute  in  general.  For  the  special  case  of  the  Gaussian  signal  class  we  derive  a 
closed-form  expression  of  Theorem  3.3. 


Theorem  3.4.  For  the  Gaussian  signal  class  Q,  sparsity  H  e  (0,1),  sampling  rate  p  <  1, 
and  fractional  distortion  a  G  (0, 1  —  £2),  a  necessary  condition  for  asymptotically  reliable 
recovery  is 


h(Ll)  —  h(Ll,  a)  +  p  Vivs(SNR(ty);  p/fl) 
P>  ^  log  (1  +  SNR(£)) 

where  the  function  V  ws{l,  f)  is  given  in  Lemma  B.3. 


(3.4) 


3.2  Discussion 

The  bound  in  Theorem  3.2  is  shown  in  Figure  3.1  as  a  function  of  a  for  SNR(T)  =  100 
and  various  H.  The  strength  of  this  bound  is  that  it  applies  generally  to  all  signal  classes. 
However,  it  is  likely  to  be  overly  conservative  for  stochastic  signal  classes.  For  the  Gaussian 
signal  class,  the  bound  in  Theorem  3.4  is  shown  in  Figure  3.2  as  a  function  of  a  for  SNR(^)  = 
105  and  various  H.  Our  intuition  suggests  that  support  recovery  is  more  difficult  for  the 
Gaussian  class  than  for  the  bounded  classes.  This  difference  is  supported  by  the  results 
where  we  see  that  the  necessary  sampling  bound  in  Figure  3.2  is  higher  than  in  Figure  3.1 
even  though  the  SNR.  is  much  larger  for  the  Gaussian  bound. 

In  light  of  Theorem  3.1,  which  states  that  asymptotically  reliable  recovery  cannot  be 
achieved  with  p  <  1  and  SNR  <  oo,  we  see  that  both  Theorems  3.2  and  3.4  are  overly 
conservative  at  a  =  0.  What  is  not  clear,  is  how  the  true  performance  limit  behaves  as  a 
becomes  very  small. 
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Distortion  a 


Figure  3.1:  Necessary  sampling  density  p  as  a  function  of  the  fractional  distortion  a  for 
various  for  any  signal  class. 


Figure  3.2:  Necessary  sampling  density  p  as  a  function  of  the  fractional  distortion  a  for 
various  for  the  class  of  Gaussian  signals. 
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3.3  Proofs 


3.3.1  Proof  of  Theorem  3.1 

In  this  proof  we  consider  a  modified  problem  in  which  the  estimator  has  access  to  additional 
information  about  x.  In  this  setting  we  analyze  the  optimal  estimator  K*  and  show  that, 
for  the  task  of  perfect  recovery,  K*  is  asymptotically  unreliable.  Since  K*  can  perform  no 
worse  than  any  estimator  in  the  unmodified  problem,  this  proves  our  desired  result. 

For  a  given  signal  x,  let  io  =  arg  mirgg/f  |xj|.  Now,  imagine  that  the  decoder  knows  the 
signal  xk  and  the  set  Afi  =  K\i0,  that  is  every  element  of  the  support  except  for  i0.  Hence, 
all  that  remains  is  to  determine  which  of  the  remaining  n  —  k+1  indices  belongs  in  K .  Note 
that  y  —  (1>k0xk0  ~  Af(xiodi,  cr2 /)  and  thus  the  MAP  estimate  of  i0  is  given  by 

*o  =  arg  min  \  \y  -  $KlxKl  ~  xioaj\\2 

iGA'fi 

=  arg  min  llw  +  Xi0di0  —  Xi0a A  |2 
jeKx 

For  this  decoder,  an  error  occurs  if  there  exists  j  G  KL  such  that 

1 1  +  Xiq  dj,Q  x^q  cij  ||  <1  1 1  1 1  . 

Let  Pl  =  x^/a'lj.  By  multiplying  both  sides  of  the  above  equation  by  n/x2o  we  see  that 
the  probability  of  error  is  equal  to  the  probablity  of  the  following  event 

min  \\Q  +  X0  + XA\2  <  \\Q\\2  (3.5) 

l<j <n—k 

where  Q  ~  A/”(0,  ^-/m)  and  Xt  ~  A/”(0,  Im )  independently  for  all  i. 

To  bound  this  probability  we  define  the  following  events 

^  =  {Im  <  IIA'oll2  <  2m}  n  {—  <  ||QI|2  <  2^}  . 

A2  =  {QTX0  <  0}  . 

LIsing  large  deviations  bound  for  central  y2  variables  (Lemma  A.l)  we  have  P{A!}  >  1— e_Cin 
for  some  c\  >  0.  Further  we  note  that  P{A2|Ai}  =  1/2.  Hence  we  can  bound  the  event 
A  =  Ai  fl  A2  as 

P {A}  =  P{A1}P{A2|A1}  >  (1/2) (1  -  e~cin)  (3.6) 

Note  that  || Q  +  A"0||2  =  ||Q|j2  +  !|A"0||2  +  2QTX0  and  thus  the  event  A  places  a  bound  on 
the  relative  difference  between  ||Q  +  A"0||J  and  ||Q||k 

Lemma  A. 3  says  that  the  cumulative  distribution  function  of  a  non-central  \nc  variable 
is  decreasing  in  the  non-centrality  parameter.  Using  this  fact  and  conditioning  on  A  gives 
the  following  bound  for  all  j  =  1,  •  •  •  ,  to  —  k 

P  {HQ  +  Xq  +  Xj\\2  <  ||g||2  |  A}  >  t  min2]  P  |y^c  (m,  2  m  +  j-fj  < 

=  F  { Xnc  { m >  2m  +  72 m2)  <  7 2m2} 
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where  y2  correspond  to  the  minimizing  value  of  t. 

For  a  given  signal  class  X  we  have  /3l  <  /3l{X)  where  the  bound  is  tight  for  non-stochastic 
classes  and  holds  with  exponentially  high  probability  in  n  for  stochastic  signal  classes.  By 
the  assumptions  / 3l(X )  <  00  and  p  <  1  we  have  y2  >  0.  Also,  since  the  probability  of  error 
only  increases  as  /3l  becomes  small,  it  is  sufficient  to  consider  (5l>  0  for  all  n.  In  this  case, 
we  have  y2  <  00. 

Let  Zj  be  i.i.d.  Xnc  (m>  2m  +  72m2)  and  let  fz{z)  denote  its  probability  density.  We 
now  use  Lemma  A. 4  which  gives  a  lower  bound  on  fz{z).  Specifically,  given  some  finite 
constant  r  >  0,  there  exists  some  constant  c  >  0  and  integer  M  <  00  such  that 

fz(z)  >  - 
m 

for  all  m>  M  and  z  G  [72m2  —  rm,72m2].  This  means  that 

/72  m2 

fz{z)dz 

-OO 

/•72m2  r 

>  /  -dz 

172m2-™  ^ 

=  TC>  0 


Hence,  the  total  probability  of  error  is  bounded  by 


p(  min  Hg  +  Xo  +  X.ll2  <  HQlp 

I l<j<n—k 

>P{A}p(  min  WQ  +  Xo  +  XjW2  <  ||Q||2  I  A 

I  l<j<n—k 


>  P{A}P  <  min  Zj  <  7 2m2 

I  <n  —  k 

=  P{A}  \l-F{Zj>  7 2m2}n~k 

>  P{A}  1  -  (1  -  P  {Zj  <  7 2m2})n~k 

>  P{A}  1  -  (1  -  rc)n~k 

and  we  see  that  the  estimator  is  asymptotically  unreliable 


3.3.2  Proof  of  Theorems  3.2,  3.3,  and  3.4 

We  begin  with  the  proof  of  Theorem  3.3  which  implies  Theorem  3.2.  This  proof  is  given  in 
the  paper  [3]  for  the  Gaussian  signal  class  X ,  the  task  of  perfect  recovery,  and  the  case  where 
<f>  corresponds  to  m  rows  of  the  n  x  n  Fourier  matrix.  We  give  the  proof  here  for  general 
signal  class  X ,  the  task  of  partial  recovery,  and  in  terms  of  our  Gaussian  sampling  matrix  <f>. 
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We  define  s  =  xk  and  note  that  the  pair  (. K ,  s)  is  equivalent  to  x.  Also,  we  define 
z  =  <34  to  be  the  noiseless  version  of  the  samples.  Finally,  we  use  standard  definitions  from 
information  theory  [25]. 

The  data  processing  inequality  gives 

I  fay)  >  I{s,K\y), 

and  the  chain  rule  for  mutual  information  gives 

I(s,K;y)  =  I(K-,y)  +  I(s;y\K). 


Thus  we  have 


I(z]y)>I(K]y)  +  I(s]y\K). 

Since  the  noise  w  is  i.i.d.  Gaussian,  and  the  signal-to-noise  ratio  between  z  and  y  =  z  +  w 
is  given  by  SNR(A),  we  see  that  an  upper  bound  for  the  information  I(z’,y)  is  given  by  the 
channel  capacity  of  an  additive  white  Gaussian  noise  channel.  Thus, 

I(z‘,y)  <  log(l  +  SNR(A)). 

Next,  we  consider  how  much  information  is  required  between  K  and  y.  Given  that  K 
is  uniformly  distributed  over  the  (”)  possibilities,  a  simple  counting  argument  and  the  fact 
that  log  (")  =  nh(k/n)  +  0(\ogn)  shows  that  the  asymptotic  number  of  bits  we  need  to 
decode  K  to  within  accuracy  a  is  given  by  nh(Q)  —  nh(Q,a).  Using  Fano’s  inequality,  we 
see  that  Pe(a)  >  0  unless 

I(K;y )  >nh(Q)  —  nh{VL,a) 

Putting  everything  together  gives  the  necessary  condition  of  Theorem  3.3 

n  hi  Q)  —  n  hi  Q,  a)  +  I  is;  y\K ) 

TYl  '^>  - - - 

|log(l  +  SNR(A)) 

Now,  the  simplified  bound  in  Theorem  3.2  follows  from  the  fact  that  I(s\y\K)  >  0. 
For  stochastic  signal  classes,  however  it  may  be  possible  to  give  a  positive  lower  bound  to 
I(s;y\K).  For  the  Gaussian  class  Q,  Gastpar  [3]  showed  that 

I{s\  y\K  =  U)  =  l-  logdet  (4  +  SNR^)^^)  , 

and  thus 

I{s\y\K)  =  EKI(s-y\K  =  U)  =  ^-logdet  (4  +  SNR (G)^TK$K)  • 

For  a  given  sampling  matrix,  such  as  random  rows  of  the  Fourier  matrix,  this  information 
may  be  difficult  to  compute.  However,  for  our  sampling  matrix  <f>  we  can  derive  a  closed 
form  solution  using  facts  from  Appendix  B  and  the  following  lemma. 
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Lemma  3.1.  For  a  nonnegative  definite  matrix  M  e  Bnxn  let  \\ (M),  ■  ■  ■  A n(M)  denote  the 
eigenvalues  of  M  and  let  Fffix)  denote  the  empirical  eigenvalue  distribution  (B.l).  For 
7  >  0  we  have 


1  v— >  1 

—  logdetf/  +  7  M)  =  —  log  det(/  +  7  A  AM)) 

n  n 

i=  1 

log(l  +  7  x)dF^(x) 


Proof.  This  follows  from  the  properties  of  the  determinant.  □ 

As  is  shown  in  Appendix  B,  the  matrix  is  a  Wishart  Matrix  for  all  K .  By  the 

Marcenko-Pastur  law  [26],  the  empirical  probability  distribution  of  eigenvalues  of  this  matrix 
converge  to  a  non-random  continuous  function  as  n  — >  00.  This  function  is  given  in  Lemma 
B.l.  This  also  means  that  I(s;y\K  =  U)  converges  to  a  value  that  does  not  depend  on 
U.  This  quantity  is  given  by  the  so-called  Shannon  transform  Vws(7hr)  in  Lemma  B.3.  To 
conclude  we  have, 


—I(s]  y\K  =  U) 
n 


P  Vws(SNR(C7);  pjX) 


(3.7) 


Chapter  4 

Support  Recovery:  Sufficient 
Conditions 


This  chapter  deals  with  what  can  be  achieved  in  support  recovery.  For  the  sampling  model 
described  in  Chapter  2,  we  give  sufficient  conditions  on  the  number  and  quality  of  samples 
needed  to  recover  a  given  fraction  of  the  support.  These  results  are  complementary  to  the 
necessary  conditions  given  in  Chapter  3.  The  bounds  correspond  to  the  performance  of 
a  maximum  likelihood  (ML)  estimator  that  performs  an  exhaustive  search  over  all  possible 
supports.  This  estimator  is  described  in  Section  4.1.  The  results,  which  are  stated  in  Section 
4.2  and  discussed  in  Section  4.3,  show  that  ML  estimator  can  guarantee  some  fraction  of 
the  support.  If  this  fraction  is  not  too  close  to  one,  then  only  a  modest  sampling  rate  is 
sufficient.  However,  in  accordance  with  the  results  of  Chapter  3,  no  guarantees  can  be  made 
as  the  desired  fraction  becomes  close  to  one.  Proofs  are  given  in  section  Section  4.4 


4.1  ML  Support  Estimation 

We  consider  an  ideal  decoding  algorithm  which  has  knowledge  of  the  sparsity  parameter 
k  and,  given  the  samples  y,  performs  an  exhaustive  search  over  all  candidate  supports  U 
with  size  k  to  determine  the  most  likely  estimate  of  K .  Further,  our  analysis  focuses  on 
an  estimator  that  is  independent  of  the  signal  class  X  and  assumes  that  all  fc-sparse  signals 
x  G  R"  are  equally  likely. 

Definition  4.1.  The  ML  estimator  Kml(u)  is  given  by 

KML(y)  =  argmax  max  P {y\K  =  U,  xk  =  z}  (4.1) 

\U\=k  zeKfc 

This  is  the  same  estimator  studied  (for  the  special  case  of  a  =  0)  in  Wainwright  [23] 
for  the  bounded  signal  class.  The  following  result  provides  as  an  equivalent  form  of  the  ML 
estimate. 


29 


Lemma  4.1.  The  ML  estimator  is  given  by 


(4.2) 

(4.3) 


Kml(v)  =  arg  min  min  | y  -  <bvz ||2 
\U\=k  z£Wk 

=  arg  min  \\[Im  -  $c/($£$[/)-1$£];z/| |2. 

\U\=k 

Proof.  The  follows  from  our  definition  of  the  ML  estimate  and  is  shown  in  the  paper  [23] .  □ 

The  ML  estimator  is  not  optimal  because  it  ignores  information  about  X.  However  we 
study  it  for  several  reasons.  First,  the  ML  decoder  can  be  seen  as  a  universal  decoder  that 
is  appropriate  for  cases  in  which  we  have  incomplete  or  non-existent  knowledge  of  X .  For 
instance,  the  implementation  of  ML  decoder  does  not  depend  on  the  value  of  (3  or  (3l-  Second, 
for  many  signal  classes,  the  difference  between  the  optimal  and  ML  estimates  decreases  as 
the  SNR.  becomes  large.  Finally,  as  is  shown  in  Lemma  4.1,  the  ML  decoder  can  be  written 
in  terms  of  orthogonal  projections  of  y  which  is  a  formulation  that  simplifies  our  analysis. 

We  remark  that  ML  decoding  is  computationally  hard  for  any  problem  of  non-trivial 
size.  However,  the  resulting  achievable  bound  is  interesting  because  it  allows  us  to  see  where 
there  is  potential  for  improvement  in  current  sub-optimal  recovery  algorithms.  Furthermore, 
if  one  is  able  to  lower  bound  the  performance  of  some  efficient  estimator  with  respect  to  the 
optimal  decoder,  then  an  achievable  result  is  automatically  attained. 

For  completeness  we  also  provide  the  maximum  a  posteriori  (MAP)  estimator  for  the 
Gaussian  signal  class. 

Lemma  4.2.  For  the  signal  class  Q,  the  MAP  estimator  is  given  by 

KMap{v ,  Q)  =  arg  min  min  ||y  -  $uz\\2  +  \\\z\\2  (4.4) 

\U\=k  zGRfe  p 

=  arg  min  ||[Jm  -  y\?-  (4.5) 

\U\=k  H 

Proof.  This  follows  from  the  definition  of  a  MAP  estimate  for  Gaussian  variables.  □ 

4.2  Results 

Intuitively,  the  components  of  x k  with  the  smallest  magnitudes  are  the  most  likely  to  be  left 
out  of  the  estimated  support.  Our  achievable  bound  relies  on  the  total  magnitude  of  all  the 
missed  components  of  %.  Accordingly  we  introduce  the  following  term  which  is  a  function 
of  x. 

Definition  4.2.  Let  s  E  correspond  to  the  non-zero  elements  of  x.  Assume  that  s  is 
indexed  such  that  |si|  <  .sy  |  <  •  •  •  <  |.Sfc| .  Then,  for  some  a  <  1,  the  normalized  magnitude 
of  the  smallest  a  =  \ak~\  elements  of  s  is 

g(a,  x)  =  1  (4-6) 

a \\x  r 


30 


As  we  saw  before  with  the  SNR,  we  need  a  bound  on  g(a,x )  that  holds  for  a  given  signal 
class  X. 

Definition  4.3.  For  a  given  signal  class  X ,  g(a,  X)  is  an  asymptotic  lower  bound  on  g(a,  x ). 
If  X  is  non-stochastic  then  the  bound  must  satisfy 

g(a,  X)  <  g(a,x)  for  all  x  G  X.  (4.7) 

If  X  is  stochastic  then  there  must  exist  some  constant  c  >  0  such  that 


P  {g(a,  Xn)  <  g(a,  x)}  >  l  -  e  nc 


(4.8) 


For  the  bounded  class,  it  is  clear  that  we  may  choose  g(a,B)  =  1.  For  the  Gaussian 
class,  a  suitable  choice  is  provided  by  the  following  lemma  which  is  proved  in  4.4.3. 


Lemma  4.3.  For  the  Gaussian  signal  class  Q  we  may  choose 

g(a,  Q)  =  ~W  (_e-(2/a)Ma)-l)  >  e-(2/a)h(a)-l  >  ^2^3  (4.9) 

where  the  Lambert- W  function  W(z )  is  the  inverse  function  of  f(z)  =  zez. 

Remarkably,  the  bounds  SNR  (A)  and  g(a,X)  tell  us  all  we  need  to  know  about  the 
signal  class  X  for  our  achievable  bound.  We  now  state  our  main  theorem  which  gives  a  set 
of  sufficient  conditions  for  asymptotically  reliable  recovery  using  the  ML  estimator. 


Theorem  4.1.  For  a  given  signal  class  X,  sparsity  ff  G  (0,1),  sampling  rate  p  <  1, 
and  fractional  distortion  a  G  (0, 1  —  fl),  the  estimator  A'ml(|/)  is  asymptotically  reliable 
if  SNR(A)  >  1  /(a  g(a,  X))  and 


p  LI  T 


2  h(Q,  u ) 

max  - — ; - 

ne[a,i-n]  i0g  (SNR(A)  u  g(u,  X))  +  (SNR(A)  u  g(u,  X))  -  1 


(4.10) 


where  the  function  g(u,X)  satisfies  definition  f.3,  and  h(Q,u)  is  given  by  (3.1). 


In  the  following  corollaries,  we  provided  a  simplified,  and  necessarily  weaker  set  of  suf¬ 
ficient  conditions  for  the  bounded  and  Gaussian  signal  classes.  These  corollaries  make  it 
easier  to  see  the  approximate  scaling  behavior  of  the  bounds. 


Corollary  4.1.  For  the  bounded  signal  class  B,  sparsity  e  (0, 1),  sampling  rate  p  <  1, 
and  fractional  distortion  a  G  (0, 1  —  fl),  the  estimator  KML,{y)  is  asymptotically  reliable  if 
SNR(£>)  >  e/a  and 


o  2h(Q) 

P>  +log  (SNR(B)  a/e)' 


(4.11) 
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Corollary  4.2.  For  the  Gaussian  signal  class  Q,  sparsity  14  G  (0,1),  sampling  rate  p  <  1, 
and  fractional  distortion  a  G  (0, 1  —  14),  the  estimator  Kml{v )  is  asymptotically  reliable  if 
SNR(£)  >  e4/o3  and 


o  2 

P  >  14  H - 7 - — - r-. 

log  (SNR(<7)  a3/e4) 


(4.12) 


In  the  following  corollary  we  consider  a  specific  relationship  between  the  density  and  the 
sparsity  and  show  how  the  upper  bound  on  the  fractional  distortion  depends  on  the  SNR. 


Corollary  4.3  (Achievable  Distortion).  Let  the  sampling  density  be  given  by  p  =  14  +  2h(14). 
With  exponentially  high  probability  in  n,  the  fractional  distortion  of  the  estimator  KML(y) 
obeys 


a  <  e2/SNR(B)  (4.13) 

for  the  bounded  signal  class  and 

a  <  (e2/SNR(£))1/3  (4.14) 


for  the  Gaussian  class. 

4.3  Discussion 

In  this  section,  we  discuss  the  implications  of  Theorem  4.1  for  the  bounded  and  Gaussian 
signal  classes.  Figures  4.1  and  4.2  (log  scale)  show  the  sufficient  sampling  bound  as  a  function 
of  a  for  various  14  for  the  bounded  signal  class  with  SNR  =  100.  Figures  4.3  and  4.4  (log 
scale)  show  the  sufficient  sampling  bound  as  a  function  of  a  for  various  14  for  the  Gaussian 
signal  class  with  SNR  =  105.  Also  shown,  are  the  corresponding  lower  bounds  from  Chapter 
3. 

For  both  signal  classes,  we  see  that  recovery  in  the  under-sampled  setting  with  fixed 
SNR  is  possible  over  a  range  of  a.  However,  as  a  becomes  small,  the  sampling  rate  increases 
without  bound.  This  confirms  our  results  from  Chapter  3  which  indicate  that  perfect  recovery 
in  the  presence  of  noise  is  very  challenging.  At  the  same  time,  it  shows  that  if  we  accept  a 
small  fraction  of  errors,  reasonable  results  can  be  attained.  Also,  from  Figures  4.2  and  4.4 
we  see  that  the  upper  and  lower  bond  are  reasonably  tight  for  values  of  a  that  are  not  near 
0  or  1  —  14. 

We  may  also  consider  non-asymptotic  results.  By  paying  attention  to  the  exact  error 
exponents  we  can  give  a  bound  on  the  probability  of  error  for  fixed  n.  In  Table  4.1  we  show 
the  sufficient  conditions  to  recover  99.9%  of  the  support  when  n  =  10,  000  and  Xk  is  bounded 
from  below.  For  example,  when  k  =  200  and  m  =  522  the  probability  that  the  support  is 
not  perfectly  recovered  is  less  than  0.0001.  In  Table  4.2,  we  show  sufficient  conditions  to 
recover  90%  of  the  support  when  n  =  10,  000  and  xk  is  Gaussian.  In  this  setting  we  see  that 
a  high  SNR  is  required,  but  that  reliable  recovery  can  be  performed  with  very  few  samples. 
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Figure  4.1:  Sufficient  (bold)  and  necessary  (light)  sampling  densities  p  as  a  function  of  the 
fractional  distortion  a  for  various  for  the  class  of  bounded  signals. 


Figure  4.2:  Sufficient  (bold)  and  necessary  (light)  sampling  densities  p  (Log  scale)  as  a 
function  of  the  fractional  distortion  a  for  various  fl  for  the  class  of  bounded  signals. 
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Distortion  a 

Figure  4.3:  Sufficient  (bold)  and  necessary  (light)  sampling  densities  p  as  a  function  of  the 
fractional  distortion  a  for  various  £4  for  the  class  of  Gaussian  signals. 


Figure  4.4:  Sufficient  (bold)  and  necessary  (light)  sampling  densities  p  (log  scale)  as  a 
function  of  the  fractional  distortion  a  for  various  £4  for  the  class  of  Gaussian  signals. 
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Table  4.1:  Sufficient  number  of  samples  to  recover  99.9%  of  the  support  when  n  =  10,000 
and  x  is  bounded. 


Table  4.2:  Sufficient  number  of  samples  to  recover  90%  of  the  support  when  n  =  10,000  and 
x  is  Gaussian 


4.4  Proofs 


4.4.1  Proof  of  Theorem  4.1 


The  main  technical  result  underlying  the  proof  is  the  following  lemma  which  relates  the 
desired  error  probability,  Pe(a),  to  the  large  deviations  behavior  of  multiple  independent 
chi-squared  variables.  The  proof  of  this  lemma  is  given  in  Section  4.4.2. 


Lemma  4.4.  For  given  parameters  (■ n ,  k,  m,  a)  and  signal  class  Xn,  let  g(u,  Xn)  satisfy  def¬ 
inition  4-3  with  error  exponent  c0  >  0  for  all  u  G  [a,  1  —  k/n].  Then,  for  any  scalar  t  >  0 


Pe(a)  <  P {x2(m 

\k2/n\ 

+  E 

a=  [ ak\ 


—  k)  >  t} 


fn  —  k\ 
V  a  ) 


P  {x2(m  —  k)  <  r(a)  t }  + 


e~nc° 


(4.15) 


where  r(a)  =  [SNR {X){a/k)g{a/k,X)\  \ 

We  proceed  by  using  large  deviation  bounds  and  the  union  bound  to  lower  bound  the 
error  exponents  of  the  terms  on  the  right  hand  side  of  (4.15).  We  then  determine  conditions 
such  that  the  exponents  are  positive. 
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First,  for  any  v  >  0  we  may  choose  tv  —  (1  +  v)(m  —  k ).  By  the  upper  concentration 
bound  in  Lemma  A.l  we  have 


--  log  (P(y2(m  -  k)  >  t„))  <  Ex 
n 

where  E\  =  (p  —  f2)z/2/4  is  positive  for  all  p  >  fl 

Next,  we  consider  the  remaining  term  in  (4.15)  evaluated  with  tv  for  v  arbitrarily  close 
to  zero.  To  use  the  lower  concentration  bound  in  Lemma  A.l  requires 


SNR(A)  >  max  - - - — 

iie[a,i-n]  ug(u,  X) 


1 

ag(a,  X) 


If  this  inequality  is  satisfied,  then 

~  log  (P  {x2ipi  -  k)  <  t(o)  t})  <  E2(a) 


where 

log  (SNR (X)(a/k)g(a/k,  X ))  +  (SNR (X)(a/k)g(a/k,  X ))  1  —  1 

Finally,  we  note  that  log  (^)  (n~A)  — >  n  h(Q,  a/k )  as  n  — >  oo,  where  h(Q,  a)  is  given  by  (3.1). 
Putting  everything  together  gives 

\k2/n] 

Pe(a )  <  e~nEl  +  [e-n(E{a)-h^a/k))  +e-nco] 

a=  \_ak\ 

<  e~nE  1+  max  e-n(E(a)-h(fl,a/k))+logk  e-nc0+logk 

ak<a<(l—Q,)k 

and  the  bound  in  Theorem  4.1  guarantees  that  E(a)  —  h(Q,  a/k )  >0  for  ak  <  a  <  (1  —  f 1)k. 


4.4.2  Proof  of  Lemma  4.4 

For  a  given  set  (n,  k,  m,  a,  SNR(An),  g(u,  Xn))  we  derive  an  upper  bound  on  Pe(a)  for  the  ML 
decoder.  In  particular,  we  analyze  Pe(a\K),  the  error  conditioned  on  the  true  set  K.  Because 
of  the  symmetry  in  the  sampling  procedure,  we  will  see  that  the  conditional  probability  does 
not  depend  on  the  particular  set  K .  Therefore  for  any  distribution  over  K  we  have 

Pe(a)=  F(K)Pe(a\K)  =  Pe(a\K). 

K:  \K\=k 


E2(a)  —  (p  —  Q)- 
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Analysis  of  ML  Decoder 

As  is  shown  in  [23],  the  ML  decoder  in  Lemma  4.2  can  be  written  explicitly  as  a  single 
minimization.  For  any  subset  U  with  \U\  =  k  <  m  the  m  x  k  matrix  <3>j/  has  full  rank  with 
probability  one.  Thus  we  may  define  the  following  orthogonal  projections 

Uu  =  Qu^Qu]-1^  (4.16) 

n.b  =  I-$u[*T$u]-1$T.  (4.17) 

corresponding  to  the  range  (respectively  null)  space  of  The  ML  decoder  can  now  be 
given  as 


KMl  =  arg  min  err  (U). 
\u\=k 


(4.18) 


where  err ( 17)  =  1  in^Vz/l  |2- 

Analysis  of  Error 

Let  a*  =  [ak\  and  consider  the  sets 


G  —  {U  :  \U\  =  k,  | U  n  K |  >  k  -  a*} 

B  =  {U  :  \U\=  k,  | U  n  K\  <  k  -  a*} 

where  G  represents  the  “good”  set  of  candidate  supports  and  B  represents  the  “bad”  set. 
An  error  is  declared  if  K  e  B  and  this  occurs  if  and  only  if 

min  err  (17)  <  min  err  (17). 
ugb  veG 

To  develop  a  bound  on  the  above  event,  we  split  it  into  to  sub-events  which  can  be  analyzed 
independently.  For  any  scalar  t  we  define  to  the  the  following  two  “bad”  events 

Ab  =  <  minerr(f/)  <  t  >  ,  AG  =  <  min  err  (E)  >  t  >  . 

[UeB  J  [V£G  J 

Note  that  ACB  D  ACG  is  a  sufficient  condition  for  success.  Accordingly,  the  probability  of  error 
can  be  bounded  as 


Pe(fl*\K)  <  1  —  P  (Acb  n  Aq) 

=  P  (Ab  U  Ag) 

<  P  (Ab)  +  P  (Ag) 

Our  proof  proceeds  by  relating  P(AS)  and  P(AB)  to  the  the  cumulative  distribution  functions 
of  various  y2  variables. 
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Bounding  P(7Lg) 

To  tightly  characterize  the  behavior  of  minyeG  err(H)  would  require  significant  effort.  How¬ 
ever,  since  it  is  always  true  that  K  e  G,  it  is  sufficient  to  consider  the  weakened  bound 
P(Ag)  <  P(err (K)  >  t ).  Since  w  has  zero  mean  i.i.d.  Gaussian  elements,  and  is  an 
orthonormal  projection  matrix,  the  random  variable 

err  (if)  =  \\\Ii^y\\2  =  -4llniH|2- 

has  a  y2  distribution  with  m  —  k  degrees  of  freedom  (Lemma  A. 2). 

Bounding  P(AS) 

To  bound  the  probability  that  mmUeB  err(U)  <  t,  it  is  sufficient  to  bound  err ( 17)  for  every 
possible  U  e  B.  At  a  high  level,  our  approach  to  this  task  is  to  repeatedly  partition  AB  into 
smaller  sub-events  whose  behavior  we  can  characterize  with  respect  to  y2  variables.  Then, 
repeated  application  of  the  union  bound  lead  a  bound  on  our  desired  quantity  P(AB). 

We  begin  by  partitioning  U  G  B  based  on  the  size  of  the  overlap  k  —  a  =  \U  D  K |.  Let 

B(a)  =  {U  :  \U\  =  k,\U  D  K\  =  k-a }, 

AB(a)  =  \  min  err(^)  <  * 
l U£B{a ) 

Then,  the  union  bound  gives 

Umax 

P(^B)  <  P(^W).  (4.19) 

a=a* 

where  amax  —  \k  —  k2/n\.  Accordingly,  the  next  step  is  to  bound  the  event  P(Ag^)  for  all 

®  5;  ®  5;  (1  m  ax  ■ 

Bounding  P (A5(a)) 

For  any  U  G  B(a)  the  distribution  of  err (17)  can  be  characterized  with  respect  to  two 
quantities.  First,  conditioned  on  the  set  K\U,  the  magnitude  of  the  “missed”  components 
of  x  is  given  by  SNR(xx\t/)-  Second,  for  any  set  U  the  magnitude  of  the  projected  noise  is 
given  by 


A(c)  =  4lin>||2. 

® w 

Since  w  is  a  Gaussian  vector  with  i.i.d.  elements  and  is  an  orthonormal  projection 
matrix,  A (17)  is  a  y2  random  variable  with  m  —  k  degrees  of  freedom  (Lemma  A. 2). 
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Now,  conditioned  on  SNR(xk\c/)  =  9  the  random  vector  (a2#)-1/2  $k\uXk\u  has  i.i.d. 
zero  mean  Gaussian  elements  with  variance  one.  If  we  also  condition  on  A (U)  =  A,  then  we 
see  that 


\err(U)  =  -^\  |n^  ($k\uXk\u  +  w)  ||2 

is  a  non-central  Xnc  variable  with  non-centrality  parameter  X/9  and  m—k  degrees  of  freedom 
due  to  the  orthogonal  projection  II ^  (Lemma  A. 2).  This  means  that 

P{err([Z)  <  t|SNR(xx\[/)  =  9,  A (U)  =  A}  =  P {xNc(m  ~  k,  X/9)  <  t/9 }  (4.20) 

To  reduce  amount  of  conditioning  required  in  (4.20)  we  use  Lemma  A. 3  which  states 
the  the  cumulative  distribution  function  of  non-central  y2  variable  is  decreasing  the  non¬ 
centrality  parameter.  Noting  that  A (U)  >  0  we  have 

P{err(G)  <  t\sm(xK\u)  =  9}  <  F{x2NC(m  ~  k,0)  <  t/9 } 

=  P{x2(m  —  k)  <  t/9 }, 


and  so 


P{err(I7)  <  i|SNR(x^\C/)  >  9}  <  P{y2(m  —  k)  <  t/9}. 
By  our  assumptions  we  have 

Sm(xK\u)  <  l/r(a)}  <  e" 


mm 

U&B(a ) 


=  -nc0 


where  r(a)  =  [SNR (X)(a/ k)g{a/ k,  X)\  1.  Thus,  the  union  bound  gives 


p{^s(a)}  <e  nco+  FU2(rn  -  k)  <  r(a)t} 

U£B(a) 

'k\  f  n  —  k N 
. a)  V  a 


=  e  0  + 


X2(m  -  k)  <  r(a)t}. 


4.4.3  Proof  of  Lemma  4.3 

We  are  given  xk  ~  M{ 0,  cr2/fc)  and  an  =  \akn~\. 

\U\  =  a},  then  by  definition  we  have 

g(a,x)  =  min  - 

UeC(a)  \\Xk\  I2 

We  note  that  the  variable  Ha^H2/^2  is  X2  with  k  degrees  of  freedom  and,  for  each  set  I/, 
the  variable  ||a;t/||2/<T2  is  y2  with  a  degrees  of  freedom.  Although  these  variable  are  not 


If  we  let  a  =  \akn]  and  C(a)  =  {U  C  K  : 
ka2x  1 1^;/||2 
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independent,  we  can  use  the  union  bound  to  bound  g(a,x)  in  terms  of  independent  events. 
For  any  ei,  e2  >  0  we  have 


¥{g(a,x)  <  (1  -  ci)(l  -  e2)}  <  V{\\xk\\*/ctI  <  (1  -  ci)t} 

+  £  <(!-<*)«} 

U£C(a) 

=  F{X2(k)<(l  -ei)k} 

+  |^(a)|P{y2(a)  <  (1  -  e2)a} 

Using  concentration  bounds  for  central  y2  variables  (Lemma  A.l)  we  may  set  arbitrarily 
close  to  zero.  Furthermore,  we  note  that  |C(a)|  =  Q).  As  n  — >  oo  this  means  that 

“log  (|(7(a)|P{y2(a)  <  (l-e2)a})  <E0(Q,e,p) 

where 

E0(Q,  e,  p )  =  Uai  [-  log(l  -  e2)  +  e2]  -  £lh(a). 

Solving  for  the  critical  value  of  e2  such  that  lim,woo  Eq(Q,  e,  p)  >0  leads  to  the  stated  bound. 
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Chapter  5 

Signal  Estimation:  Effects  of  Noise 
Prior  to  Sampling 


While  Chapters  3  and  4  were  concerned  with  support  recovery,  the  focus  of  this  chapter 
is  signal  recovery.  Our  goal  is  to  understand  how  the  distortion,  measured  by  the  mean 
squared  error  (MSE),  is  affected  by  where  the  noise  enters  into  the  sampling  processes.  We 
compare  two  noise  models:  the  standard  model  in  which  the  noise  is  added  to  each  sample, 
and  a  variation  in  which  the  noise  is  added  to  the  signal  x  prior  to  sampling.  For  stochastic 
signal  classes  we  derive  closed-form  expressions  for  the  distortion  of  the  linear  minimal  mean 
squared  error  (LMMSE)  estimator  that  has  a  priori  knowledge  of  the  support.  Section  5.1 
describes  the  noise  models,  Section  5.2  describes  the  LMMSE  estimator,  Section  5.3  gives 
our  results,  Section  5.4  provides  discussion,  and  proofs  are  given  in  section  5.5. 


5.1  Observation  Error  versus  Sampling  Error 

In  this  chapter  we  make  a  distinction  between  noise  that  is  added  to  each  sample  (sampling 
error)  and  noise  that  is  added  to  the  signal  prior  to  sampling  (observation  error).  Accordingly 
we  generalize  the  sampling  model  proposed  in  Chapter  2,  to  the  following 

y  =  ®(x  +  wohs)  +  wsmp  (5.1) 

where  wQ bs  is  observation  error  and  wsmp  is  sampling  error  and  all  other  assumptions  about 
<f>  and  K  are  the  same. 

To  motivate  these  sources  of  noise,  we  consider  the  sensor  network  application  introduced 
in  Section  1.3.  Figure  5.1  below  shows  the  same  sensor  network  with  the  source  of  observation 
noise  and  sampling  noise  made  explicit.  In  this  context,  the  observation  noise  arises  from 
the  noise  and  quantization  of  the  sensor  reading  itself.  The  sampling  noise  arises  from  the 
computation  and  communication  over  the  network,  and  any  subsequent  quantization. 

To  understand  the  relative  tradeoff  of  the  two  error  types  we  consider  samples  taken  with 
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n-dimensional 
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•  x 
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Figure  5.1:  Sparse  signal  sampling  using  a  sensor  network  with  observation  and  sampling 
error 


only  one  kind  of  error,  that  is 


yobs  =  +  W  obs)  (5.2) 

2/smp  4~  icsmp  (5-3) 

where  both  wQ bs  G  and  wD bs  G  Rm  have  i.i.d.  elements  with  zero  mean  and  variance 
a2,.  Note  that  elements  of  the  random  vector  <f>-u;obs  also  have  variance  a2,  but  are  not 
independent.  We  consider  any  stochastic  signal  class  X  in  which  the  non- zero  elements  are 
i.i.d.  with  zero  mean  and  variance  ax.  Thus  throughout  this  chapter  f3  =  /3(X)  =  cr^/cr^. 
Finally,  we  do  not  constrain  our  our  analysis  to  the  under-sampled  setting  and  consider  any 

p  >  0. 

5.2  Optimal  Linear  Estimator 

In  this  section  we  consider  estimation  of  the  signal  x  when  the  support  K  is  known.  This  is 
interesting  when  K  can  be  accurately  determined  or  when  K  is  known  a  priori  but  there  is 
no  control  over  how  the  samples  are  taken.  Additionally,  the  analysis  of  this  case  shows  the 
fundamental  differences  between  the  types  of  noise  regardless  of  uncertainty  in  the  support. 
Finally,  we  note  that  for  the  Gaussian  class  Q  the  LMMSE  estimator  is  the  optimal  MSE 
estimator.  For  results  concerning  a  sub-optimal  efficient  t\  constrained  estimator  without 
knowledge  of  the  true  support  see  [27]. 

Conditioned  on  a  particular  sampling  matrix  and  support  K ,  let  £x,  'Ey,  and  Exy 
denote  the  covariance  and  cross  correlation  matrices  of  x  and  y.  It  is  well  known  that  the 
LLMSE  linear  estimator  is  given  by 

x(y\K)  =  ExyE~ly.  (5.4) 

Moreover,  the  conditional  distortion  is  given  by 

D(X\K,$)  =  ^LeIWx  -  x(y\K)\\2}  =  ^tr{Ex-ExyE-1Eyx}  (5.5) 
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Hence,  for  a  given  sampling  problem  the  distortion  is  given  by 


D(X)  =  E[D(X\K,  $)]. 


(5.6) 


5.3  Results 

We  give  a  closed  form  expression  for  the  MSE  distortion  of  LMMSE  estimator  for  both  sam¬ 
pling  error  and  observation  error.  Onr  results  are  formulated  in  terms  of  the  77-transform  of 
the  asymptotic  spectra  of  the  Wishart  and  F  matrices  (see  Appendix  C  for  more  informa¬ 
tion). 

Theorem  5.1.  Let  X  be  a  stochastic  signal  whose  non-zero  elements  are  i.i.d.  with  zero 
mean  and  variance  ax.  For  sparsity  H  G  (0, 1)  and  sampling  rate  p  consider  the  distortion 
of  the  linear  minimum  mean  squared  estimator.  Under  the  influence  of  sampling  noise  i.i.d. 
with  zero  mean  and  variance  the  distortion  is  given  by 

Dsmp(X )  =  t)WS  (p/3;  £)  ,  (5.7) 

where  (3  =  cr^/o and  rjws(%',r)  is  given  by  Lemma  B.f.  Under  the  influence  of  observation 
noise  i.i.d.  with  zero  mean  and  variance  the  distortion  is  given  by 


f  P 

1 +p 

p  +  Vfm  (1  _a(P  +  !);«>  i-n) 

0  <  p  <  1  —  n 

Dobs(X)  —  < 

P 

I  i+p 

)  +  ¥1™  (h2^  + 1);  h?.  A2 )' 

1  -  n  <  p  <  1 

(5.8) 

4/(1  +  P) 

1  <p 

where  (3  =  er^/cr^  and  ri,  rf)  is  given  by  Lemma  B.5. 

5.4  Discussion 

The  distortions  in  Theorem  5.1  are  shown  in  the  following  figures.  Figure  5.2  shows  log10  D 
as  a  function  of  p  for  various  fl,  Figure  5.3  shows  log10  D  as  a  function  of  p  for  various  /?, 
and  Figure  5.4  shows  log10  D  as  a  function  of  H  for  various  scalings  of  p(fl). 

The  results  are  in  line  with  our  intuition.  For  p  <  1  the  noise  due  to  observation  noise 
is  correlated  and  is  less  detrimental  to  recovery.  On  the  other  hand,  for  p  >  1,  Da bs  remains 
constant  whereas  Dsmp  — >  0  as  p  — >  00.  Figure  5.3  shows  that  these  results  are  consistent 
for  a  range  of  SNR.  We  remark  that  in  all  cases,  the  differences  between  the  types  of  noise 
becomes  very  small  as  fl  becomes  small.  This  occurs  because  t/fm  (ft)  ?/wm(/3),  as  the 
ratio  r2  of  the  F-Matrix  goes  to  zero. 
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Figure  5.2:  The  distortion,  log10  D,  as  a  function  of  p  for  various  and  (3  =  100  under 
sampling  noise  (solid)  and  observation  noise  (dashed). 


Figure  5.3:  The  distortion,  log10  D,  as  a  function  of  p  for  various  (3  and  Q  =  0.4  under 
sampling  noise  (solid)  and  observation  noise  (dashed). 

5.5  Proofs 

The  first  step  toward  calculating  the  distortion  is  to  recast  the  conditional  distortion  (5.6) 
in  terms  of  a  single  random  matrix.  Under  the  sampling  error  model  we  have 

£x  =  cr2Jn  =  al$K§TK  +  <J2wIm  Zxy  =  §tk. 
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Figure  5.4:  The  distortion,  log10  D,  as  a  function  of  fl  for  various  linear  scalings  of  p  and  fl 
for  (3  =  100  under  sampling  noise  (solid)  and  observation  noise  (dashed). 


and  thus 

Dsmp(X\K,  $)  =  jtr  { In  -  $£($*$£  +  (l//?)/)"1^}  (5.9) 

=  ^tr{(4  +  /3$^)-1}.  (5.10) 

where  (^)<f>^<34  is  a  central  Wishart  matrix. 

Under  the  observation  error  model  we  have 

S,  =  cr2xIn  T,y  =  a2x$K$TK  +  a2w$K±<f>TK±  Y>xy  = 

and  thus 

Dsmp(X\K,  <f>)  =  -j-tr  {4  -  $£($*$£  +  (1  (5.11) 

To  simplify  the  above  equation  we  must  consider  whether  or  not  the  covariance  matrix 
Tjwi  =  44^$^  corresponding  to  &K±wobs,Kc  is  invertible.  For  m  <  n  —  k,  the  matrix 
is  full  rank  and  the  resulting  conditional  distortion  is 


DobB)1{X\K,  $)  = 


P 


1  +  f3 


yh  +  \e 


tr{(4  +  (l  +  /3)4jE;,14/f) 


-1 


(5.12) 


where  is  a  central  F  matrix. 

For  n  —  k  <  m  <  n,  the  matrix  T,w>  has  rank  n  —  k  and  is  not  invertible.  The  distortion 
can  be  determined  using  Sra<  =  which  has  the  same  non- zero  eigenvalues  as  Em/, 
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and  an  n  —  k  x  n  —  m  random  matrix  G  whose  i.i.d.  elements  have  the  same  distribution  as 
the  elements  of  <f>.  The  conditional  distortion  is  equal  in  distribution  to  the  following 


D  A  P 

JJo  bs,2  — 


1  +  /3 


T  1  „ 
j3  +  kEtI 


Im-n  +  (1  +  P)G'rlG 


-1 


(5.13) 


where  (-J^— )GTS~,1G  is  a  central  F  matrix.  Finally,  for  m  >  n,  the  matrix  $  is  invertible 
and  the  distortion  is  simply  1/(1  +  /3). 

The  next  step  involves  evaluating  the  expectations  of  the  conditional  distortions  given  in 
(5.10),  (5.12),  and  (5.13).  We  use  the  following  fact. 

Lemma  5.1.  For  a  nonnegative  definite  matrix  M  e  ®"xn  let  Ai(M),  •  •  •  Xn(M)  denote  the 
eigenvalues  of  M  and  let  Ffr(x)  denote  the  empirical  eigenvalue  distribution  ( B.l ).  For 
7  >  0  we  have 


n 


tr { (I  +  7M)  x}  = 


7^1  +  7  Xi(M) 

1 


1  +  72: 


-dF, 


M 


,X) 


(5.14) 

(5.15) 


Proof.  This  follows  from  the  properties  of  the  trace. 


□ 


We  are  interested  in  the  asymptotic  limits  as  n,  k,  m  — >  00  with  k/n  — >  fl  and  m/n  — >  p. 
In  this  setting,  it  has  been  shown  that  for  both  both  the  Wishart  [26]  and  F  matrices  [28]  the 
empirical  probability  distributions  converge  to  non-random  continuous  functions  with  closed 
form  expressions  which  are  given  in  Lemmas  B.l  and  B.2.  This  means  that  the  conditional 
distributions  (5.10),  (5.12),  and  (5.13)  converge  to  some  non-random  quantity.  Moreover, 
this  quantity,  which  corresponds  to  the  r/- transform  (defined  in  appendix  B),  has  a  closed 
from  solution  for  Wishart  matrix  and  the  F  matrix.  To  conclude  we  note  that  as  n  — >  00, 


DSm.p(X\K,  $) 
DohB,i(X\K,  $) 

DohSt2(X\K,  $) 


hws  (p/3;  £) 


/3 


1  T  /3 

P 

1  +  /3 


~p + ^ 


n 


1  —  0 


(/3  +  l  P 


1  1  -  P 

p  + 


i-p 


1  -n 

l-n  l-n 


(/3  + 1); 


i-p 


p 


where  r/ws(7; ''"')  and  t/fm(^TiT2)  are  given  by  Lemmas  B.4  and  B.5. 


(5.16) 

(5.17) 

(5.18) 
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Chapter  6 

Conclusions  and  Future  Work 


Research  over  the  past  decade  has  changed  what  it  means  to  “sample”  a  signal.  The  field 
of  compressed  sensing  in  particular  has  focused  on  recovering  sparse  signals  from  a  small 
number  of  linear  projections.  In  this  thesis  we  have  given  fundamental  limits  on  what  can 
and  cannot  be  learned  about  an  unknown  signal  when  the  samples,  which  consist  of  randomly 
constructed  linear  projections,  are  corrupted  by  noise.  Our  contributions  lie  it  two  areas: 
support  recovery  bounds  and  the  effects  of  different  noise  models  in  signal  estimation. 

For  the  task  of  support  recovery,  Chapter  3  established  that  perfect  recovery  cannot  be 
achieved  in  the  setting  of  linear  sparsity  unless  the  SNR  grows  without  bound  with  the 
signal  dimension.  This  result  is  significant  because  it  shows  that,  in  many  applications, 
perfect  support  recovery  is  not  attainable. 

Next,  we  considered  partial  support  recovery,  and  showed  that  it  is  possible  to  recover 
some  fraction  of  the  support.  Chapters  3  and  4  gave  complementary  necessary  and  sufficient 
conditions  on  the  number  of  samples  and  the  SNR  required  to  attain  a  desired  accuracy. 
The  results  of  Chapter  3  gave  an  upper  bound  on  the  fraction  of  the  support  that  can  be 
recovered  by  any  estimation  algorithm.  The  results  of  Chapter  4  gave  a  lower  bound  on  the 
fraction  of  the  support  that  can  be  recovered  using  a  particle  maximum  likelihood  estimator. 

Our  results  on  partial  support  recovery  quantify  how  much  information  about  the  sup¬ 
port  can  be  extracted  from  noisy  linear  projections.  Unlike  previous  bounds  which  consid¬ 
ered  only  perfect  recovery,  we  are  able  derive  achievable  results  for  support  recovery  in  the 
under-sampled  linear  sparsity  setting.  Also,  our  results  were  developed  in  parallel  for  both 
stochastic  and  non-stochastic  signal  models. 

Our  other  contribution  dealt  with  the  effects  of  different  noise  models  on  the  ability  to 
estimate  a  sparse  signal.  Chapter  5  compared  the  traditional  sampling  model,  where  noise  is 
added  independently  to  each  sample,  to  a  model  where  noise  is  added  to  the  signal  prior  to 
sampling.  We  used  results  on  the  asymptotic  spectrum  of  certain  random  matrices  to  derive 
closed  form  expressions  of  the  means  squared  error  distortion  for  both  models.  Our  results 
showed  that  in  the  under-sampled  setting,  noise  added  prior  to  sampling  is  less  detrimental 
than  noise  added  to  the  samples.  Furthermore,  the  difference  between  the  two  types  of  noise 
decreases  as  the  unknown  signal  becomes  sparser. 

We  recognize  several  directions  for  further  work.  One  potentially  useful  extension  of  this 
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thesis  is  to  derive  sufficient  conditions,  like  those  given  in  Chapter  4,  that  correspond  to  an 
efficient  estimation  algorithm.  The  main  drawback  of  the  ML  decoder  we  analyzed  is  that  it 
is  computationally  hard,  and  is  thus  not  practical  in  many  settings.  One  way  this  analysis 
might  be  achieved,  is  to  bound  the  performance  of  some  efficient  estimator  with  respect  to 
the  ML  estimator.  In  conjunction  with  the  results  of  Chapter  4,  such  an  approach  would 
yield  an  achievable  result  for  the  efficient  estimator. 

Next,  we  note  that  our  primary  motivation  for  using  a  gaussian  sampling  matrix  was  that 
it  simplified  our  analysis.  In  general,  however,  it  would  be  interesting  to  see  to  what  degree 
our  results  hold  for  other  matrix  constructions.  This  would  be  particularly  interesting  for 
matrices  which  arise  naturally  out  of  the  physics  of  the  sampling  process,  or  for  matrices 
that  are  themselves  sparse,  and  are  thus  easy  to  implement. 

Finally,  our  results  on  signal  estimation  in  Chapter  5  illustrate  the  impact  of  where  the 
noise  enters  into  the  samples.  A  natural  extension  is  to  see  what  effect  noise  added  prior  to 
sampling  has  on  the  ability  to  recover  some  fraction  of  the  support. 
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Appendix  A 

Facts  about  Chi  Squared  Variables 


In  this  appendix  we  provide  a  number  of  useful  facts  about  y2  random  variables  and  prove 
Lemmas  that  are  used  throughout  the  thesis.  We  begin  with  some  standard  definitions. 

Definition  A.l.  Given  d  independent  variables  X,  A f(ni,  of),  then 


(A.l) 


is  a  central  y2(d)  variable  with  d  degrees  of  freedom. 

Remark  A.l.  The  variable  Z  ~  y2(c/)  is  non- negative,  has  mean  d,  and  variance  2d.  For 
z  >  0  its  probability  density  function  is  given  by 

^m/2—lp—d/2 

fz^  =  2d/2T(d/2)  (A'2) 

where  T(s)  =  /0°°  ts~le~tdt  is  the  Gamma  function. 

Definition  A. 2.  Given  d  independent  variables  Xt  ~  A f(ni,  erf)  then 


(A.3) 


is  a  non-central  x%c(d,  v)  variable  with  d  degrees  of  freedom  and  non-centrality  parameter 


v  = 


(A.4) 


Remark  A. 2.  The  variable  Z  ~  XNc(d,v)  is  non-negative,  has  mean  d  +  u,  and  variance 
2(d  +  2u).  Let  N  ~  Poisson (/y/2)  then  Q  is  equal  in  distribution  to  a  central  y2  variable 
Y  ~  X2(d  +  2 N).  Accordingly,  its  probability  density  function  can  be  written  in  terms  of 
the  central  y2  distribution  as 


fz(q)  = 


i= 0 


e-^//2(z//2), 


i\ 


fx2{d+2i)(z)- 


(A.5) 
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Next  we  will  provide  some  useful  properties.  We  start  with  large  deviation  bounds. 


Lemma  A.l.  Given  Z  ~  y2(d)  and  any  e  >  0  the  following  bounds  hold 
Pr{Z  >  (1  +  e)d}  <  exp  f — ^-e2 


Pr  <  Z  < 


(1  +  e) 


7 1  .  d 

d  >  <  exp  —  - 


log(l  +  e)  - 


1  +  6 


(A. 6) 
(A. 7) 


Proof.  This  lemma  follows  from  a  Chernoff  bound  and  the  proof  is  outlined  in  [29].  Let  Z 
be  a  central  y2  variable  with  d  degrees  of  freedom.  For  any  p  <  0 


Pr(Z  <z)=  Pr(z^z~z)  >  1). 


By  Markov’s  inequality  this  means 

Pr(Z  <  z)  < 


The  inhmum  occurs  at  /i  =  (1  —  dj z) /2  and  plugging  this  value  into  (A. 8)  results  in  the 
lower  tail  bound  (A. 6).  The  well  known  upper  tail  bound  (A. 7)  is  proved  in  [29].  □ 

Lemma  A. 2.  Given  a  p-dimensional  random  vector  X  ~  A/”(0, 1)  and  any  pxp  orthonormal 
projection  matrix  II  with  rank  d  and  independent  of  X ,  then  ||ILY  I!2  ~  X2(d)- 

Proof.  We  can  write  II  =  UTAU  where  A  is  a  diagonal  matrix  with  exactly  k  ones  and  p  —  k 
zeros  and  UTU  =  L  Since  U  is  orthonormal,  the  vector  Y  =  UX  is  equal  in  distribution  to 
X.  Thus  we  have 


inf  E  \e^z~z)] 
o  L  J 

inf  (1  -  2/i)-d/2exp(-x/i) 
o 

inf  exp  (  —  d\  ln(l  —  2/i)  —  fiz 
m<o  V  2 


(A.8) 


\\Ux\?  =  XATyAf  =  E  K.2  (A-9> 

i=  1  i  :  Xi^=0 

which  is  x2  with  d  degrees  of  freedom  because  Yi  are  independently  distributed  A/”(0, 1).  □ 

Lemma  A. 3.  The  cumulative  distribution  function  of  a  non-central  \2nc  variable  is  decreas¬ 
ing  in  the  non- centrality  parameter  v,  that  is,  for  any  integer  d  >  0  and  scalars  x,  u'  >  v  >  0 
we  have 


Pr  {x2Nc(d>  V')  <A-  Pr  {x2Nc(d>  v)  <  X } 


(A.10) 
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Proof.  Let  Q(x;d,u)  =  Pr  {x%c(diu>)  <  and  let  F(x;d )  =  Pr{x2(d )  <  x}.  Note  that 
Q(x ;  d,  v)  can  be  written  in  terms  of  F(x;  •)  as 

Q(x;  d,v)  =  ^2  e~u/2<yl  ^  F(x;  d  +  2 j) 

3=0  J' 

Taking  the  partial  derivative  with  respect  to  v  gives 

d,  v)  =  i  ^  e-"/2^-  [F(x;  d  +  2  +  2j)  -  F(x;  d  +  2 j)\ 

U  3=0  J' 


Since  F(x;  d)  is  strictly  decreasing  in  d  the  above  quantity  is  strictly  negative.  Thus, 
Q(x ;  d,  u)  is  strictly  decreasing  in  v.  □ 

The  following  lemma  is  a  counterpart  to  a  large  deviation  bound.  Roughly  speaking,  it 
will  be  used  to  bound  a  non-central  XNc(d,  d 2)  variable  a  distance  of  d  away  from  its  mean. 

Lemma  A. 4.  Given  any  r,  71  <  00  and  72  >  0  there  exists  a  constant  c  >  0  and  integer 
M  <  00  such  that  for  all  t  e  [0,  r]  and  m  >  M  the  probability  distribution  function  of  the 
non-central  Xnc  variable  Z  ~  xjvc(mi  7i  +  72 m2)  is  bounded  as 


fz(E[Z]  —  t  ■  m)  > 


co 


m 


(A.ll) 


Proof.  Let  N  ~  Poisson((-yim  +  72'm2)/2)  and  let  P  =  m  +  2N.  Then,  the  pdf  of  Z  can  be 
written  as 


fz(z)  =  Ei 


tP/2-le-z/2 

[2p/2T(P/2)J 


—  E, 


[2T(P/2) 


exp 


17  -  1  I  lo8  (  5 


(A. 12) 


Using  Stirling’s  approximation,  T(s)  =  x/ffire  S+U/Sss  U2  with  \v\  <  1/12,  gives 


fz(z)  >  E/ 


LV4t rP 

Using  the  bound  log(s)  <  s  —  1  gives 
fz{z)  >  Ep 

>  Ep 


P 


exP  1  ~  77  -  1  lo§  —  + 


P 


P-Z 


r  1  1 

Lf^-i 

l  V  2 

V4t rP  P  1 

^  pyn  < 

r  (^-^)2 

WA-nP  P 1 

[  2z 

P-z\  +  P  -  z 


1 

6 P 


1 

6 P 


1 

6 P 


Note  that  E[P]  =  E [Z\  =  (1  +  7i)m  +  72m2.  Thus,  with  the  substitution  z  =  E [Z\  —  t,  rn  we 
have 


/z(E[Z]  -  im)  >  Ei 


.  vU tP 


exp 


(P  —  E[P]  +  tm)2  _ 
2(E[P]  —  tm)  6  P 
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Using  standard  concentration  results  for  Poisson  random  variable  [30],  there  exists  some 
constant  Co  >  0  and  integer  M0  <  oo  such  that,  for  all  m  >  M0  the  following  bound  holds 

Pr{\P  —  E[P]|  <  2m}  >  c0.  (A. 13) 

Accordingly,  for  m  >  Mo  we  have 

/z(E[Z]  -  t  rii)  >  Ci  exp{-C2} 


where 


Ci  = 
C2  = 


c0 


^/47r[(3  +  7i)m  +  7  2m2] 
(2  +  t)2m 


2[(1  +  7i  —  t)m  +  y2m2]  6[(1  +  71  )m  +  7  2m2 


Given  r,  we  may  choose  some  M  >  Mi  and  constants  ci  >  0  and  c2  <  00  such  that  for  all 
m  >  M  we  have 

Ci  >  — , 
m 

C2  <  c2 


for  all  t  G  [0,  r].  With  c  =  Cie  C2  we  conclude  our  proof. 

□ 
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Appendix  B 

The  Asymptotic  Spectrum  of  Random 
Matrices 


In  this  appendix  we  provide  results  about  the  limiting  eigenvalue  distribution  of  two  matrices 
that  arise  from  our  sampling  model.  With  the  exception  of  Lemma  B.5,  this  is  a  compilation 
of  known  results  (see  [31]  for  more  information).  In  general  these  results  hold  for  complex 
matrices,  but  for  simplicity  we  consider  only  real  matrices. 

We  begin  with  the  definitions  of  the  central  Wishart  and  central  F  matrices. 

Definition  B.l.  Let  H  €  MmxA:  have  zero  mean  i.i.d.  entries  with  variance  one.  The  k  x  k 
matrix  (  —  )HTH  is  a  central  Wishart  matrix. 

Definition  B.2.  Let  H  e  and  M  e  Rmxp  (m  <  p)  have  zero  mean  i.i.d.  Gaussian 

entries  with  unit  variance.  The  k  x  k  matrix  (p)HT((^)MMT)~1H  is  a  central  F  matrix. 

In  many  applications  in  signal  processing  and  communications,  the  relevant  performance 
metrics  only  depend  on  the  singular  values  of  the  matrix  M,  rather  than  on  its  more  precise 
structure.  Accordingly,  a  great  deal  of  research  has  focused  on  the  behavior  of  the  singular 
values  for  certain  random  matrices.  In  this  work,  we  need  the  distribution  of  the  eigenvalues 
(i.e.  the  spectrum)  for  the  Wishart  and  F  matrix.  For  an  n  x  n  matrix  M  the  empirical 
cumulative  distribution  of  the  eigenvalues  is  given  by 

1  n 

fJW  =  <  x}  (B.l) 

1=1 

where  Ai(M),  •  •  •  Xn(M)  are  the  eigenvalues  of  M  and  !{•}  is  the  indicator  function.  A 
remarkable  fact  about  the  matrices  we  are  interested  in,  is  that  their  empirical  distributions 
converge  to  a  non-random  continuous  function  as  n  — >  oo.  For  the  Wishart  matrix  this  is 
the  classic  Marcenko-Pastur  law  [26]  and  is  given  in  Lemma  B.l.  For  the  F  matrix  this  result 
was  shown  by  Silverstein  [28]  and  is  given  in  Lemma  B.2. 
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Lemma  B.l.  Let  H  €  Mmxfc  have  zero  mean  i.i.d.  entries  with  variance  one.  Ifk/rn  — >  r 
as  k,m  — >  oo,  f//en  f//e  empirical  spectral  density  of  the  central  Wishart  matrix  )H*H 
converges  to 


fws(x ;  r) 


5(x)  + 


-  «)  +  (k 

2nrx 


(B.2) 


with  a  =  (1  —  V?)2,  6  =  (1  +  v^)2  and  (:c)  +  =  max{0,a:}. 

Lemma  B.2.  Let  H  e  Rmxfc  and  M  e  Cmxp  (m  <  p)  have  zero  mean  i.i.d.  Gaussian 
entries  with  unit  variance.  If  m/k  — >  ri  and  m/p  — >  r2  G  (0,1)  as  kpm,p  — >  oo7  fden  i/ie 
empirical  spectral  density  of  the  central  F  matrix  converges  to 


ft  \  i  i  ,  1\+  xr  \  ,  (l-r2)y/(x-a)+(b-x)+ 

/™(x;  ri.  r2)  =  |  1  +  -  j  *(*)  + - 2^(riI  +  - 


(B.3) 


with 


a  = 


1  -  V1  ~  (1  -  r1)(l  -  r27 
1  -  r2 


6  = 


1  +  y/1  -  (1  -  r i)(l  -  r^J 
1  -  r2 


In  typical  applications  involving  a  random  matrix  M  the  properties  of  interest  depend  on 
the  expected  value  of  some  function  of  M.  The  following  two  transforms  are  taken  directly 
from  [31]  and  are  useful  for  many  problems  in  signal  processing  and  communications. 

Definition  B.3.  Given  a  nonnegative  variable  X  and  7  >  0  the  Shannon  transform  is 

VY(7)  =  iE[log(l  +  7X)]  (B.4) 

Definition  B.4.  Given  a  nonnegative  variable  X  and  7  >  0  the  ^-transform  is 


Vx(nt)  =  E 


1 

1  +  jX 


(B.5) 


The  following  lemmas  provide  the  transforms  necessary  for  this  thesis.  Lemmas  B.4 
and  B.4  are  taken  from  [31]  whereas  Lemma  B.4  is,  to  our  knowledge,  the  first  closed  form 
expression  of  the  //-transform  for  the  F-matrix. 


Lemma  B.3.  The  Shannon  transform  of  the  distribution  fws(x',r)  is  given  by 

V\vs{ 7;  r)  =  log  (1  +  7  -  ^(7,  r))  +  i  log  (l  +  r'y  -  ^(7,  r))  +  ^^1(7,  r)  (B.6) 

with 

Fi(nt,  r)  =  ^  (V 7(1  +  Vr)2  +  1  -  ^7(1  -  \/r)2  +  1^  •  (B.7) 
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Proof.  The  key  to  solving  this  nontrivial  integration  problem  is  to  use  the  substitution  x  = 
1  +  c  —  2^/ccosu  for  a  properly  chosen  value  of  c.  □ 
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